Reply to topic  [ 8 posts ] 
Extract HTML and images from document 
Author Message

Joined: 2008-03-16 16:41:16
Posts: 20
I was playing around with Nisus and Perl this weekend and managed to create an export to clean HTML script (with some table help from Martin after I submitted a request for a feature that already exists).

One of the examples also extracts images from a document, which is probably not very useful but does show off how information can be extracted from RTF.

I'm posting here in case anyone finds it useful or wants to comment on how it could be improved:

http://www.hoboes.com/Mimsy/?ART=643

_________________
Mimsy Were the Borogoves


2008-03-16 17:01:30
Profile WWW
Official Nisus Person
User avatar

Joined: 2002-07-11 17:14:10
Posts: 4251
Location: San Diego, CA
Post 
Wow, that's quite a macro you've put together, impressive!

NWP's Save As HTML comes to us "for free" courtesy of Apple. However, the resulting fidelity relies on their RTF import capabilities (and HTML export of course), so may not be perfect. At some point we may use a different library to accomplish HTML export, but that time may be far off.

Also, in response to one comment in your article: we are aware that it would be useful for macros to have richer access to formatting information, eg: applied style names. I believe you recently submitted a request for such a feature, so thank you for that. Munging RTF is never fun.


2008-03-17 16:14:22
Profile WWW

Joined: 2007-01-17 05:46:17
Posts: 145
Location: Tokyo, Japan
Post 
Hello capvideo,

Although I have not yet tried or understood your scripts, I am very impressed by them. They seem really very good. I will try them, and try to study them.

You write in your web page:
capvideo wrote:
it’d be a whole lot nicer (and likely more reliable) if I could get the style names directly from Nisus instead of having to parse them out of the RTF.


I think this is possible now with the new macro language for NWP 1.1 public beta 1: there is "attributes" object, with which you will be able to get the following properties:

fontSize
fontName
fontFamilyName
postScriptFontName
bold
italic
underline
strikethrough
superscript
characterCase
characterStyleName
paragraphStyleName
languageCode

I think this will be quite useful for your script.

I developed a system to convert NWP documents (and Plain Cocoa rtf documents) to Tex (see http://www.bekkoame.ne.jp/~n-iyanag/res ... LaTeX.html); this is a quite laborious system, containing many problems. I think I should work to re-write the code for NWP, to use the new macro functionalities. I will try to learn from your codes to do this work.

_________________
Best regards,

Nobumi Iyanaga
Tokyo,
Japan


2008-04-18 00:43:26
Profile WWW
Official Nisus Person
User avatar

Joined: 2002-07-11 17:14:10
Posts: 4251
Location: San Diego, CA
Post 
Yes, the new macro Attributes object should provide useful. It is however not complete- there is still formatting information that needs to be made available. If there's any particular piece of information you find more critical than another, let us know so we can prioritize it for future enhancements.


2008-04-18 12:58:47
Profile WWW

Joined: 2008-03-16 16:41:16
Posts: 20
Post 
Nobumi and martin, yes, it looks pretty useful. I'll be taking a look at the .characterStyleName and the .paragraphStyleName attributes mainly. I'll definitely upload the new version when I do this.

I think this leaves only images and navigator level that have to be pulled from the RTF; (by navigator level, I mean the table of contents level as it shows in the navigator, to get whether a paragraph-level block should be h1, h2, etc.)

So if there were an attribute or means of getting the navigator level of a particular paragraph, that would be useful.

Jerry

_________________
Mimsy Were the Borogoves


2008-04-18 13:54:54
Profile WWW

Joined: 2007-01-17 05:46:17
Posts: 145
Location: Tokyo, Japan
Post 
martin wrote:
Yes, the new macro Attributes object should provide useful. It is however not complete- there is still formatting information that needs to be made available. If there's any particular piece of information you find more critical than another, let us know so we can prioritize it for future enhancements.


The paragraph style name and character style name are certainly important, but their contents are more important... But that would be not easy to get them all. For my personal use, the indents for paragraph styles are most critical, because they change all the look of paragraph.

_________________
Best regards,

Nobumi Iyanaga
Tokyo,
Japan


2008-04-18 16:03:18
Profile WWW

Joined: 2007-01-17 05:46:17
Posts: 145
Location: Tokyo, Japan
Post 
Hello Jerry,

Related to your script, I think it is still necessary to read the raw rtf code, for instance for grabbing images. But with the new macro, it would be much easier -- or faster to do that. You would do:
Code:
Find All '^.+', 'E'
$doc = Document.active
$paragraphs = Array.new
$paragraphs = $doc.selectedSubtexts

ForEach $paragraph in $paragraphs
   $rtf = Encode RTF $paragraph
   # prompt $rtf or any other commands...
End


It is much faster to work with variants (here, "$paragraphs" and "$paragraph") than with:
Code:
While Select Next Paragraph
   $currentParagraph = Read Selection
....

way.

_________________
Best regards,

Nobumi Iyanaga
Tokyo,
Japan


2008-04-20 20:41:49
Profile WWW

Joined: 2007-01-17 05:46:17
Posts: 145
Location: Tokyo, Japan
Post 
Hello,

I wrote:

Nobumi Iyanaga wrote:
Related to your script, I think it is still necessary to read the raw rtf code, for instance for grabbing images. But with the new macro, it would be much easier -- or faster to do that. You would do:
Code:
Find All '^.+', 'E'
$doc = Document.active
$paragraphs = Array.new
$paragraphs = $doc.selectedSubtexts

ForEach $paragraph in $paragraphs
   $rtf = Encode RTF $paragraph
   # prompt $rtf or any other commands...
End


It is much faster to work with variants (here, "$paragraphs" and "$paragraph") than with:
Code:
While Select Next Paragraph
   $currentParagraph = Read Selection
....

way.


Related to this "Find All '^.+', 'E'", I found that if you have notes in your file, this would not work as expected.

I made a document like the following:

aiueo[footnote1] sashisuseso[endnote1]
naninuneno[footnote2]

----
1 kakikukeko
2 hahifuheho

-------
1 tachitsuteto

With the wildcard expression ".+", and Find All, I highlight:

aiueo sashisuseso[endnote1]
naninuneno

and

kakikukeko
hahifuheho

-------
tachitsuteto

Thus, ".+" finds, no footnote references, but an endnote reference.

If I use "^.+", it matches only:

aiueo
naninuneno

And if I use "^.+$", it matches nothing at all...

This last behavior is probably understandable: if "^.+$" must match all the paragraphs, and note references are considered as "non-characters", there is no paragraph without any note references. But if note references are "non-characters", why ".+" can match an endnote reference...??

Anyway, it is very inconvenient that note references cannot be selected with "."...

So, for the time being, I think we should use the code
Code:
While Select Next Paragraph
   $currentParagraph = Read Selection
   ....
End

although it is certainly much slower than the other code.

_________________
Best regards,

Nobumi Iyanaga
Tokyo,
Japan


2008-04-27 15:40:13
Profile WWW
Display posts from previous:  Sort by  
Reply to topic   [ 8 posts ] 

Who is online

Users browsing this forum: No registered users and 1 guest


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
Powered by phpBB® Forum Software © phpBB Group
Designed by ST Software