I was playing around with Nisus and Perl this weekend and managed to create an export to clean HTML script (with some table help from Martin after I submitted a request for a feature that already exists).
One of the examples also extracts images from a document, which is probably not very useful but does show off how information can be extracted from RTF.
I'm posting here in case anyone finds it useful or wants to comment on how it could be improved:
http://www.hoboes.com/Mimsy/?ART=643
Extract HTML and images from document
- martin
- Official Nisus Person
- Posts: 5227
- Joined: 2002-07-11 17:14:10
- Location: San Diego, CA
- Contact:
Wow, that's quite a macro you've put together, impressive!
NWP's Save As HTML comes to us "for free" courtesy of Apple. However, the resulting fidelity relies on their RTF import capabilities (and HTML export of course), so may not be perfect. At some point we may use a different library to accomplish HTML export, but that time may be far off.
Also, in response to one comment in your article: we are aware that it would be useful for macros to have richer access to formatting information, eg: applied style names. I believe you recently submitted a request for such a feature, so thank you for that. Munging RTF is never fun.
NWP's Save As HTML comes to us "for free" courtesy of Apple. However, the resulting fidelity relies on their RTF import capabilities (and HTML export of course), so may not be perfect. At some point we may use a different library to accomplish HTML export, but that time may be far off.
Also, in response to one comment in your article: we are aware that it would be useful for macros to have richer access to formatting information, eg: applied style names. I believe you recently submitted a request for such a feature, so thank you for that. Munging RTF is never fun.
-
- Posts: 158
- Joined: 2007-01-17 05:46:17
- Location: Tokyo, Japan
- Contact:
Hello capvideo,
Although I have not yet tried or understood your scripts, I am very impressed by them. They seem really very good. I will try them, and try to study them.
You write in your web page:
fontSize
fontName
fontFamilyName
postScriptFontName
bold
italic
underline
strikethrough
superscript
characterCase
characterStyleName
paragraphStyleName
languageCode
I think this will be quite useful for your script.
I developed a system to convert NWP documents (and Plain Cocoa rtf documents) to Tex (see http://www.bekkoame.ne.jp/~n-iyanag/res ... LaTeX.html); this is a quite laborious system, containing many problems. I think I should work to re-write the code for NWP, to use the new macro functionalities. I will try to learn from your codes to do this work.
Although I have not yet tried or understood your scripts, I am very impressed by them. They seem really very good. I will try them, and try to study them.
You write in your web page:
I think this is possible now with the new macro language for NWP 1.1 public beta 1: there is "attributes" object, with which you will be able to get the following properties:capvideo wrote:it’d be a whole lot nicer (and likely more reliable) if I could get the style names directly from Nisus instead of having to parse them out of the RTF.
fontSize
fontName
fontFamilyName
postScriptFontName
bold
italic
underline
strikethrough
superscript
characterCase
characterStyleName
paragraphStyleName
languageCode
I think this will be quite useful for your script.
I developed a system to convert NWP documents (and Plain Cocoa rtf documents) to Tex (see http://www.bekkoame.ne.jp/~n-iyanag/res ... LaTeX.html); this is a quite laborious system, containing many problems. I think I should work to re-write the code for NWP, to use the new macro functionalities. I will try to learn from your codes to do this work.
Best regards,
Nobumi Iyanaga
Tokyo,
Japan
Nobumi Iyanaga
Tokyo,
Japan
- martin
- Official Nisus Person
- Posts: 5227
- Joined: 2002-07-11 17:14:10
- Location: San Diego, CA
- Contact:
Yes, the new macro Attributes object should provide useful. It is however not complete- there is still formatting information that needs to be made available. If there's any particular piece of information you find more critical than another, let us know so we can prioritize it for future enhancements.
Nobumi and martin, yes, it looks pretty useful. I'll be taking a look at the .characterStyleName and the .paragraphStyleName attributes mainly. I'll definitely upload the new version when I do this.
I think this leaves only images and navigator level that have to be pulled from the RTF; (by navigator level, I mean the table of contents level as it shows in the navigator, to get whether a paragraph-level block should be h1, h2, etc.)
So if there were an attribute or means of getting the navigator level of a particular paragraph, that would be useful.
Jerry
I think this leaves only images and navigator level that have to be pulled from the RTF; (by navigator level, I mean the table of contents level as it shows in the navigator, to get whether a paragraph-level block should be h1, h2, etc.)
So if there were an attribute or means of getting the navigator level of a particular paragraph, that would be useful.
Jerry
-
- Posts: 158
- Joined: 2007-01-17 05:46:17
- Location: Tokyo, Japan
- Contact:
The paragraph style name and character style name are certainly important, but their contents are more important... But that would be not easy to get them all. For my personal use, the indents for paragraph styles are most critical, because they change all the look of paragraph.martin wrote:Yes, the new macro Attributes object should provide useful. It is however not complete- there is still formatting information that needs to be made available. If there's any particular piece of information you find more critical than another, let us know so we can prioritize it for future enhancements.
Best regards,
Nobumi Iyanaga
Tokyo,
Japan
Nobumi Iyanaga
Tokyo,
Japan
-
- Posts: 158
- Joined: 2007-01-17 05:46:17
- Location: Tokyo, Japan
- Contact:
Hello Jerry,
Related to your script, I think it is still necessary to read the raw rtf code, for instance for grabbing images. But with the new macro, it would be much easier -- or faster to do that. You would do:
It is much faster to work with variants (here, "$paragraphs" and "$paragraph") than with:
way.
Related to your script, I think it is still necessary to read the raw rtf code, for instance for grabbing images. But with the new macro, it would be much easier -- or faster to do that. You would do:
Code: Select all
Find All '^.+', 'E'
$doc = Document.active
$paragraphs = Array.new
$paragraphs = $doc.selectedSubtexts
ForEach $paragraph in $paragraphs
$rtf = Encode RTF $paragraph
# prompt $rtf or any other commands...
End
Code: Select all
While Select Next Paragraph
$currentParagraph = Read Selection
....
Best regards,
Nobumi Iyanaga
Tokyo,
Japan
Nobumi Iyanaga
Tokyo,
Japan
-
- Posts: 158
- Joined: 2007-01-17 05:46:17
- Location: Tokyo, Japan
- Contact:
Hello,
I wrote:
I made a document like the following:
aiueo[footnote1] sashisuseso[endnote1]
naninuneno[footnote2]
----
1 kakikukeko
2 hahifuheho
-------
1 tachitsuteto
With the wildcard expression ".+", and Find All, I highlight:
aiueo sashisuseso[endnote1]
naninuneno
and
kakikukeko
hahifuheho
-------
tachitsuteto
Thus, ".+" finds, no footnote references, but an endnote reference.
If I use "^.+", it matches only:
aiueo
naninuneno
And if I use "^.+$", it matches nothing at all...
This last behavior is probably understandable: if "^.+$" must match all the paragraphs, and note references are considered as "non-characters", there is no paragraph without any note references. But if note references are "non-characters", why ".+" can match an endnote reference...??
Anyway, it is very inconvenient that note references cannot be selected with "."...
So, for the time being, I think we should use the code
although it is certainly much slower than the other code.
I wrote:
Related to this "Find All '^.+', 'E'", I found that if you have notes in your file, this would not work as expected.Nobumi Iyanaga wrote:Related to your script, I think it is still necessary to read the raw rtf code, for instance for grabbing images. But with the new macro, it would be much easier -- or faster to do that. You would do:It is much faster to work with variants (here, "$paragraphs" and "$paragraph") than with:Code: Select all
Find All '^.+', 'E' $doc = Document.active $paragraphs = Array.new $paragraphs = $doc.selectedSubtexts ForEach $paragraph in $paragraphs $rtf = Encode RTF $paragraph # prompt $rtf or any other commands... End
way.Code: Select all
While Select Next Paragraph $currentParagraph = Read Selection ....
I made a document like the following:
aiueo[footnote1] sashisuseso[endnote1]
naninuneno[footnote2]
----
1 kakikukeko
2 hahifuheho
-------
1 tachitsuteto
With the wildcard expression ".+", and Find All, I highlight:
aiueo sashisuseso[endnote1]
naninuneno
and
kakikukeko
hahifuheho
-------
tachitsuteto
Thus, ".+" finds, no footnote references, but an endnote reference.
If I use "^.+", it matches only:
aiueo
naninuneno
And if I use "^.+$", it matches nothing at all...
This last behavior is probably understandable: if "^.+$" must match all the paragraphs, and note references are considered as "non-characters", there is no paragraph without any note references. But if note references are "non-characters", why ".+" can match an endnote reference...??
Anyway, it is very inconvenient that note references cannot be selected with "."...
So, for the time being, I think we should use the code
Code: Select all
While Select Next Paragraph
$currentParagraph = Read Selection
....
End
Best regards,
Nobumi Iyanaga
Tokyo,
Japan
Nobumi Iyanaga
Tokyo,
Japan