stingy regular expression matching
Submitted by Monday, 3 April, 2006 - 20:18
on
Hi,
I'm trying to clean up a document that was automatically converted to LaTeX. The document contains strings like these:
"(\textit{sahaprat}\textit{ī}\textit{tiniyama}). [Reply:] Not even for the piece of ground and the pot is it like this! However, the fact that, if both exist, there is no restriction of the cognition to [only] one image (\textit{ekar}\textit{ū}\textit{papratī}\textit{tiniyamaviraha}),"
(never mind the meaning )
The idea is to get rid of all the unnecessary instances of \textit{}. So I thought to search for instances of \textit{} that immediately follow one another.
I entered this in the (regexp) search box: \\textit\{(.*?)\}\\textit\{(.*?)\}
and tried replacing it with: \\textit\{$1$2\}
The problem is that the stingy operator ("?") is not stingy enough: apparently the search function looks forward for the second instance of italicization *somewhere* in the document; it doesn't look for instances of *contiguous* italics.
Thus, the regexp also finds:
\textit{some italicized stuff} Oh and here is lots of writing in between! This should not be part of what is found, but it is! \textit{next italicized stuff}.
Somehow I can't bring myself to believe that this is how the regexp should be understood. Or am I barking up the wrong tree?
I also tried a positive lookahead \\textit\{(.*?)\}(?=\\textit\{(.*?)\}), but this produced the same unwanted results.
Any advice would be greatly appreciated,
Thanks in advance,
best regards,
I'm trying to clean up a document that was automatically converted to LaTeX. The document contains strings like these:
"(\textit{sahaprat}\textit{ī}\textit{tiniyama}). [Reply:] Not even for the piece of ground and the pot is it like this! However, the fact that, if both exist, there is no restriction of the cognition to [only] one image (\textit{ekar}\textit{ū}\textit{papratī}\textit{tiniyamaviraha}),"
(never mind the meaning )
The idea is to get rid of all the unnecessary instances of \textit{}. So I thought to search for instances of \textit{} that immediately follow one another.
I entered this in the (regexp) search box: \\textit\{(.*?)\}\\textit\{(.*?)\}
and tried replacing it with: \\textit\{$1$2\}
The problem is that the stingy operator ("?") is not stingy enough: apparently the search function looks forward for the second instance of italicization *somewhere* in the document; it doesn't look for instances of *contiguous* italics.
Thus, the regexp also finds:
\textit{some italicized stuff} Oh and here is lots of writing in between! This should not be part of what is found, but it is! \textit{next italicized stuff}.
Somehow I can't bring myself to believe that this is how the regexp should be understood. Or am I barking up the wrong tree?
I also tried a positive lookahead \\textit\{(.*?)\}(?=\\textit\{(.*?)\}), but this produced the same unwanted results.
Any advice would be greatly appreciated,
Thanks in advance,
best regards,