Extract HTML and images from document

Get help using and writing Nisus Writer Pro macros.
Post Reply
capvideo
Posts: 21
Joined: 2008-03-16 16:41:16
Contact:

Extract HTML and images from document

Post by capvideo »

I was playing around with Nisus and Perl this weekend and managed to create an export to clean HTML script (with some table help from Martin after I submitted a request for a feature that already exists).

One of the examples also extracts images from a document, which is probably not very useful but does show off how information can be extracted from RTF.

I'm posting here in case anyone finds it useful or wants to comment on how it could be improved:

http://www.hoboes.com/Mimsy/?ART=643
User avatar
martin
Official Nisus Person
Posts: 5227
Joined: 2002-07-11 17:14:10
Location: San Diego, CA
Contact:

Post by martin »

Wow, that's quite a macro you've put together, impressive!

NWP's Save As HTML comes to us "for free" courtesy of Apple. However, the resulting fidelity relies on their RTF import capabilities (and HTML export of course), so may not be perfect. At some point we may use a different library to accomplish HTML export, but that time may be far off.

Also, in response to one comment in your article: we are aware that it would be useful for macros to have richer access to formatting information, eg: applied style names. I believe you recently submitted a request for such a feature, so thank you for that. Munging RTF is never fun.
Nobumi Iyanaga
Posts: 158
Joined: 2007-01-17 05:46:17
Location: Tokyo, Japan
Contact:

Post by Nobumi Iyanaga »

Hello capvideo,

Although I have not yet tried or understood your scripts, I am very impressed by them. They seem really very good. I will try them, and try to study them.

You write in your web page:
capvideo wrote:it’d be a whole lot nicer (and likely more reliable) if I could get the style names directly from Nisus instead of having to parse them out of the RTF.
I think this is possible now with the new macro language for NWP 1.1 public beta 1: there is "attributes" object, with which you will be able to get the following properties:

fontSize
fontName
fontFamilyName
postScriptFontName
bold
italic
underline
strikethrough
superscript
characterCase
characterStyleName
paragraphStyleName
languageCode

I think this will be quite useful for your script.

I developed a system to convert NWP documents (and Plain Cocoa rtf documents) to Tex (see http://www.bekkoame.ne.jp/~n-iyanag/res ... LaTeX.html); this is a quite laborious system, containing many problems. I think I should work to re-write the code for NWP, to use the new macro functionalities. I will try to learn from your codes to do this work.
Best regards,

Nobumi Iyanaga
Tokyo,
Japan
User avatar
martin
Official Nisus Person
Posts: 5227
Joined: 2002-07-11 17:14:10
Location: San Diego, CA
Contact:

Post by martin »

Yes, the new macro Attributes object should provide useful. It is however not complete- there is still formatting information that needs to be made available. If there's any particular piece of information you find more critical than another, let us know so we can prioritize it for future enhancements.
capvideo
Posts: 21
Joined: 2008-03-16 16:41:16
Contact:

Post by capvideo »

Nobumi and martin, yes, it looks pretty useful. I'll be taking a look at the .characterStyleName and the .paragraphStyleName attributes mainly. I'll definitely upload the new version when I do this.

I think this leaves only images and navigator level that have to be pulled from the RTF; (by navigator level, I mean the table of contents level as it shows in the navigator, to get whether a paragraph-level block should be h1, h2, etc.)

So if there were an attribute or means of getting the navigator level of a particular paragraph, that would be useful.

Jerry
Nobumi Iyanaga
Posts: 158
Joined: 2007-01-17 05:46:17
Location: Tokyo, Japan
Contact:

Post by Nobumi Iyanaga »

martin wrote:Yes, the new macro Attributes object should provide useful. It is however not complete- there is still formatting information that needs to be made available. If there's any particular piece of information you find more critical than another, let us know so we can prioritize it for future enhancements.
The paragraph style name and character style name are certainly important, but their contents are more important... But that would be not easy to get them all. For my personal use, the indents for paragraph styles are most critical, because they change all the look of paragraph.
Best regards,

Nobumi Iyanaga
Tokyo,
Japan
Nobumi Iyanaga
Posts: 158
Joined: 2007-01-17 05:46:17
Location: Tokyo, Japan
Contact:

Post by Nobumi Iyanaga »

Hello Jerry,

Related to your script, I think it is still necessary to read the raw rtf code, for instance for grabbing images. But with the new macro, it would be much easier -- or faster to do that. You would do:

Code: Select all

Find All '^.+', 'E'
$doc = Document.active
$paragraphs = Array.new
$paragraphs = $doc.selectedSubtexts

ForEach $paragraph in $paragraphs
	$rtf = Encode RTF $paragraph
	# prompt $rtf or any other commands...
End
It is much faster to work with variants (here, "$paragraphs" and "$paragraph") than with:

Code: Select all

While Select Next Paragraph
	$currentParagraph = Read Selection
....
way.
Best regards,

Nobumi Iyanaga
Tokyo,
Japan
Nobumi Iyanaga
Posts: 158
Joined: 2007-01-17 05:46:17
Location: Tokyo, Japan
Contact:

Post by Nobumi Iyanaga »

Hello,

I wrote:
Nobumi Iyanaga wrote:Related to your script, I think it is still necessary to read the raw rtf code, for instance for grabbing images. But with the new macro, it would be much easier -- or faster to do that. You would do:

Code: Select all

Find All '^.+', 'E'
$doc = Document.active
$paragraphs = Array.new
$paragraphs = $doc.selectedSubtexts

ForEach $paragraph in $paragraphs
	$rtf = Encode RTF $paragraph
	# prompt $rtf or any other commands...
End
It is much faster to work with variants (here, "$paragraphs" and "$paragraph") than with:

Code: Select all

While Select Next Paragraph
	$currentParagraph = Read Selection
....
way.
Related to this "Find All '^.+', 'E'", I found that if you have notes in your file, this would not work as expected.

I made a document like the following:

aiueo[footnote1] sashisuseso[endnote1]
naninuneno[footnote2]

----
1 kakikukeko
2 hahifuheho

-------
1 tachitsuteto

With the wildcard expression ".+", and Find All, I highlight:

aiueo sashisuseso[endnote1]
naninuneno

and

kakikukeko
hahifuheho

-------
tachitsuteto

Thus, ".+" finds, no footnote references, but an endnote reference.

If I use "^.+", it matches only:

aiueo
naninuneno

And if I use "^.+$", it matches nothing at all...

This last behavior is probably understandable: if "^.+$" must match all the paragraphs, and note references are considered as "non-characters", there is no paragraph without any note references. But if note references are "non-characters", why ".+" can match an endnote reference...??

Anyway, it is very inconvenient that note references cannot be selected with "."...

So, for the time being, I think we should use the code

Code: Select all

While Select Next Paragraph 
   $currentParagraph = Read Selection 
   ....
End
although it is certainly much slower than the other code.
Best regards,

Nobumi Iyanaga
Tokyo,
Japan
Post Reply