jEdit Community - Resources for users of the jEdit Text Editor
HTML Cleanup (delete tags and contents)
Submitted by modestmoose on Wednesday, 7 April, 2010 - 20:07
I am working on a project at my law school. I receive news articles in .docx format and then convert them to .html. This requires a lot of cleanup of the extra tags Word puts into the text. I have a very simple macro that does a find and replace to delete common unneccessary tags. What I want to do is the entire header tag and its contents (and other tags such as table tags).

I have no programming background and can't figure it out. How would I have a macro search for (head) and (/head) and then delete the tags and everything in between?

Thank you ahead of time for any insights you can give me,

Brian
Comment viewing options
Select your preferred way to display the comments and click 'Save settings' to activate your changes.
Regular Expressions
by Robert Schwenn on Thu, 08/04/2010 - 19:42
You should record a macro after You've manually tested it successful with settings like these:
- Regular Expressions = ON
- search for "(head)(.*\n)*.*(/head)"
- replace with ""

Of course You have to replace the brackets around "head" with the right ones.

Robert
 
Hmmm
by modestmoose on Fri, 09/04/2010 - 16:26
I've found that "(head)(.*\n)*.*(/head)" works great for the (head) tags because there is only one (head) tag in an html document.

But, when I replaced (head) with (table) and (/head) with (/table) I realized that "(table)(.*\n)*.*(/table)" searches for the last (/table) tag in the entire document.

What I would like to do is have it search for the next (/table) tag that appears in the document, not the last one. Can I do this easily?
 
Stingy (Minimal) Matching
by Robert Schwenn on Fri, 09/04/2010 - 19:39
You should search the docs for "Stingy (Minimal) Matching" as at the bottom of this page:
http://www.jedit.org/users-guide/regexps.html
 
It works!!!
by modestmoose on Sat, 10/04/2010 - 20:01
So, I ended up messing around with it and I got this to work: "(table(.*\n)*?(/table)". I took the closing bracket off of (table) because tables in html usually have extra modifiers stuck in the tag. The ? effectively stops it at the first instance of (/table). Let me know if this is NOT the right way to do this, it seems to work though. Thanks everyone.
 
Thanks everybody.
by modestmoose on Sat, 10/04/2010 - 20:34
 
Perfect
by modestmoose on Fri, 09/04/2010 - 16:05
Robert, you are a lifesaver, thank you. What does (.*\n)*.* do exactly? Anyway, it works great. You saved me having to do this manually 5 times a week. I appreciate it.

Brian
 
".*" means: any number of any
by Robert Schwenn on Fri, 09/04/2010 - 19:36
".*" means: any number of any characters. But since jEdit doesn't recognize the new line as "any" character I used a Workaround:
"(.*\n)*": any number of: "any number of any characters plus a new line"

Regular Expressions are very powerful but not easy every time Eye-wink

See:
http://www.jedit.org/users-guide/regexps.html
http://java.sun.com/javase/6/docs/api/java/util/regex/Pattern.html
http://www.regular-expressions.info/tutorial.html
User login
Browse archives
« November 2024  
MoTuWeThFrSaSu
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
 
Poll
Are you interested in language packs for jEdit?
Yes, and I could help maintain translations
26%
Yes, I'd like to have translations
32%
Indifferent
35%
No, that'd be bad (please comment)
7%
Total votes: 1093
Syndication
file   ver   dls
German Localization light   4.4.2.1   101634
Context Free Art (*.cfdg)   0.31   46062
BBEdit scheme   1.0   18601
JBuilder scheme   .001   18502
ColdFusion scheme   1.0   18031
R Edit Mode - extensive version   0.1   17481
Advanced HTML edit mode   1.0   16213
Matlab Edit Mode   1.0   16075
jEdit XP icons   1.0   15236
XP icons for jEdit   1.1   14300