Reply to topic  [ 20 posts ]  Go to page 1, 2  Next
Macro for SwordSearcher to USFM 
Author Message

Joined: 2011-01-12 05:32:38
Posts: 256
Hi,
I'm trying to automate conversion of Bible text files from SwordSearcher (SS) format to USFM (hhttp://paratext.org/about/usfm; same in PDF: http://paratext.org/system/files/usfmReference2_35.pdf). I don't need to include all USFM markers, of course (that would be very involved, and SS uses almost none of those markups anyway) . I want to do this because I can then output the USFM to .rtf files and have imbedded footnotes and nice text I can easily manipulate in Nisus Writer Pro.

The basic format for SS has "$$", the book name abbreviation with chapter number, then a colon and the verse number. The "¶ " [pilcrow + space] indicates beginning of a paragraph. The data between curly brackets {} is footnote data. SS's format is very straightforward, but more information is available in the help file that comes with Forge (module builder software) for SwordSearcher, which can be downloaded here: http://www.swordsearcher.com/forge/index.html. Unfortunately, it's a Windows-only app.
Code:
$$ Ge 1:1
¶ In the beginning GOD created the heaven and the earth.
$$ Ge 1:2
And the earth was without form, and void; and darkness [was] upon the face of the deep. And the Spirit of GOD moved upon the face of the waters.
$$ Ge 1:3
And GOD said, Let there be light: and there was light.
$$ Ge 1:4
And GOD saw the light, that [it was] good: and GOD divided {the light from...: Heb. between the light and between the darkness}the light from the darkness.


The format for USFM is very different. Genesis 1:1-4 would look like like the sample below. The last item in the beginning marker is a space and the last item in a closing marker is an asterisk (\nd …\nd* stand for names of diety; in the SS sample above it is presented merely in all caps, in USFM with these tags)

Code:
\id GEN
\c 1
\p
\v 1
In the beginning \nd God\nd* created the heaven and the earth.
\v 2
And the earth was without form, and void; and darkness [was] upon the face of the deep. And the Spirit of \nd God\nd* moved upon the face of the waters.
\v 3
And \nd God\nd* said, Let there be light: and there was light.
\v 4
And \nd God\nd* saw the light, that [it was] good: and \nd God\nd* divided \f + the light from...: Heb. between the light and between the darkness\f*the light from the darkness.


This 2nd USFM sample layout is better because it embeds in each footnote the chapter and verse (e.g., "1:4") to which the footnote refers (I don't know how to make a macro do this). It's marked up with the "fr 1:4 \ft " tags and data.

Code:
\id GEN
\c 1
\p
\v 1
In the beginning \nd God\nd* created the heaven and the earth.
\v 2
And the earth was without form, and void; and darkness [was] upon the face of the deep. And the Spirit of \nd God\nd* moved upon the face of the waters.
\v 3
And \nd God\nd* said, Let there be light: and there was light.
\v 4
And \nd God\nd* saw the light, that [it was] good: and \nd God\nd* divided \f + fr 1:4 \ft  the light from...: Heb. between the light and between the darkness\f*the light from the darkness.


I am learning how to do macros, and I've successfully done a very basic one that can renumber the verses in one chapter. My problem is that I need a macro that will be able to do an entire book of the Bible at a time and add the chapter numbers in there (\c and the #) before a new verse #1 starts in the subsequent chapter. I don't know how to make that kind of macro. Here is my Regex (PowerFind Pro) macro that changes references without getting paragraphs right, and that marks up footnotes, but without putting in the chapter and verse reference in the footnote as I would like [see 2nd USFM sample above]).

Code:
Find and Replace '\\$\\$ [[:upper:]][[:lower:]]+ [[:digit:]]+:', '\\\\v ', 'Ea'

Find and Replace '{', '\\f + ', 'a'
Find and Replace '}', '\\f*', 'a'


Problems I'm having:
1. Getting the chapter number inserted properly for books with more than one chapter. I've attached a sample SS file for Romans 1:1-12:2 (unfortunately, this file was made before I started putting in pilcrows for beginnings of paragraphs, so if someone uses this, it would be good to insert some pilcrows randomly at the beginning of various verses to see if the \p marker is being converted correctly).
2. Getting the paragraph (prose) marker (\p ) on the line before the verse number in USFM (it follows a verse # in SS).
3. I would really like to get the reference of the verse into the footnote (as in 2nd USFM example above), so when one looks at a note at the bottom of the page (in NWP, after I export to .rtf), one can see immediately that footnote "a" is a comment on 1:4. To do this, "\fr 1:1 \ft " must be added to the footnote text. I don't know how to do that dynamically so the chapter and verse numbers are right.
4. It's not essential, but it'd be nice if the macro could convert the SSbook names (e.g., "Ge") to the proper USFM book names (e.g., "id\ GEN") at the top of the text. The abbreviations are in a .csv file attached.

I realize this is quite a project (at least to me), and I'll be grateful for any help. Thanks in advance!


Attachments:
File comment: Sample SS file (Rom 1:1-12:2) scrambled
Sample SS file of Romans 1_1thru12_2.rtf.zip [26.02 KiB]
Downloaded 229 times
File comment: SS and USFM abbreviations
SwordSearcher and USFM abbreviations.csv [553 Bytes]
Downloaded 262 times
2013-05-01 01:03:23
Profile
User avatar

Joined: 2007-02-07 00:58:12
Posts: 876
Location: Japan
Hello NisusUser,
this is certainly doable. Let me make however one general comment. While regex is great, when things get this complicated you will definitely be better off if you take an approach which first reads in the info--in your case, the chapter/verse numbers, etc., and even the text--and then prints it out again (in a new file) in the desired format. If nothing else this will make it much easier for the person maintaining the code (i.e., you) to follow what you are doing. Note that this will also free you to call on the info you need, e.g., the footnote reference in your case.

So the basic structure is going to be:
  1. Use a find all statement to read in the info. (This can also check the format for correctness).
  2. Store the info in a hash, tagged in a convenient way.
  3. Create a new file with the info from the hash

I'll have a look at the sample, and see.

_________________
philip


2013-05-01 02:19:55
Profile
User avatar

Joined: 2007-02-07 00:58:12
Posts: 876
Location: Japan
Just some quick questions:

1. About the pillcrow. Is this always placed at the beginning of a verse?
2. Checking your sample file I noticed that one footnote (in Ro 10:4) does not have a closing bracket. Is that an error? Can one assume that these footnotes always (should always) come in matching pairs?

_________________
philip


2013-05-01 02:38:28
Profile

Joined: 2011-01-12 05:32:38
Posts: 256
Hi, Philip, and thanks for the input. Is the main jist of the first post that it'd be better to use NWP's PowerFind rather than regex?

As for the ?s in your second post:
1. Pilcrows will not always be at the beginning of verses.
2. There should always be opening and closing curly brackets for footnotes. That must have been an error.

Thanks for your assistance!
Eric


2013-05-01 03:20:43
Profile
User avatar

Joined: 2007-02-07 00:58:12
Posts: 876
Location: Japan
Hello Eric,

well PowerFind is regex, but what I meant was you need to write a "real" macro for this. I am appending a very bare bones version here. This doesn't address the footnotes, or the change of abbreviation names yet (or the pilcrows). But they can be added easily following the same format.

Attachment:
Bible.nwm [18.77 KiB]
Downloaded 242 times

_________________
philip


2013-05-01 03:24:47
Profile
User avatar

Joined: 2007-02-07 00:58:12
Posts: 876
Location: Japan
NisusUser wrote:
1. Pilcrows will not always be at the beginning of verses.


Well the reason I ask, is that in your example you place the pilcrow before the verse marker. So is that the normal rule in such cases? What happens to the ones that are not at the beginning of a verse?

_________________
philip


2013-05-01 04:37:50
Profile
User avatar

Joined: 2007-02-07 00:58:12
Posts: 876
Location: Japan
Here a few explanations to my 'barebones' macro.

The first line of the macro is:
Code:
$doc = Document.active

This creates a document object, since to do just about anything in Nisus macro language you need a document object. Pretty much any macro will start with a line like that.

Code:
Find All '^\$\$ [123]?[A-Z][a-z]{,3} [1-9][0-9]?\:[1-9][0-9]?\n.+', 'Ea-i'

This is regular PowerfindPro statement. The important point here is that it doesn't replace anything. It's just a Find All statement. But it is carefully made to match the format of the data. The reason I do this is so we can be sure that the data that we are going to read in is of the correct format.

Code:
$verses = $doc.selectedSubstrings

This line saves the matched text from the document in an array. By saving it in an array, we can then process the data one item at a time, using a foreach loop.

So then we start the loop:
Code:
foreach $verse in $verses

This puts each verse of the data one at a time into the text object $verse. With this we can do:

Code:
$verse.find '^\$\$ 123]?[A-Z][a-z]{,3} [1-9][0-9]?\:[1-9][0-9]?\n.+', '$E-i'

This is practically the same thing as the earlier find. There are two important differences: (1) this find is done not on the text of the document, but instead it done on the text object consisting of a single verse which is why it uses $verse.find instead of Find All, and (2) the options do not contain "a", so this is not a find all statement. It only does a single find. This is important because of the other option "$". That is the magic option, that allows us to capture parts of the data. To do that we need to match the information we want, so we change the above to:

Code:
$verse.find '^\$\$ (123]?[A-Z][a-z]{,3}) ([1-9][0-9]?)\:([1-9][0-9]?)\n(.+)', '$E-i'

This time I have added (…) around the data we want: the book abbreviation, the chapter, the verse, and the verse text. These captured bits can now be used in the following code using the 'names' $1, $2, $3, and $4. But to make the code clearer I use named captures. So instead of using (…), I add a name for each capture, e.g., (?<abbr>…) for the book abbreviation. Now I can refer to that using $abbr. So the whole find statement looks like this:
Code:
$verse.find '^\$\$ (?<abbr>[123]?[A-Z][a-z]{,3}) (?<c>[1-9][0-9]?)\:(?<v>[1-9][0-9]?)\n(?<txt>.+)', '$E-i'


The rest of the code should hopefully be more or less self-explanatory. Basically the captured pieces are reassembled in the desired format, and then compiled into a big text object, with which we can make a new file.

_________________
philip


2013-05-01 06:06:24
Profile

Joined: 2011-01-12 05:32:38
Posts: 256
Quote:
Well the reason I ask, is that in your example you place the pilcrow before the verse marker. So is that the normal rule in such cases? What happens to the ones that are not at the beginning of a verse?


About pilcrows. The can be anywhere in the SS verses. In USFM the \p can be mid-verse. Is that what you meant? For USFM, the \p really just says that prose begins here. Techinically it has another marker for, say, poetry (\q), etc. For our purposes here, we'll just use \p.

In my examples (SS), the pilcrows are at the beginning of the verse text, but after the verse marker (e.g., $$ Ge 1:4).


2013-05-01 06:32:19
Profile
User avatar

Joined: 2007-02-07 00:58:12
Posts: 876
Location: Japan
Ok, so now here is an extended version of the previous macro. This adds a few things:
  • It adds the change in abbreviations. In the macro this is done with a hash. This currently only covers the case of Ro -> ROM, but can easily be expanded. Ideally this expansion would also be done with a macro. Ask if you have any questions.
  • It adds the footnote conversion, including adding the chapter:verse as you requested. Once you see how this is done, you should easily be able to add other changes, such as the deity names.

I have tried to comment everything, so you should be able to adjust things as necessary. If it's unclear, let me know.

Attachment:
Bible v2.nwm [19.42 KiB]
Downloaded 229 times


Best


Attachments:
Bible v2.nwm [19.42 KiB]
Downloaded 221 times

_________________
philip
2013-05-01 06:41:28
Profile

Joined: 2011-01-12 05:32:38
Posts: 256
I'm studying your notes when I can. :)

The "barebones" macro you made give this:

Code:
\id Ge
\c 1
\v 1
In the beginning God created the heaven and the earth.
\id Ge
\c 1
\v 2
And the earth was without form, and void; and darkness [was] upon the face of the deep. And the Spirit of God moved upon the face of the waters.
\id Ge
\c 1
\v 3
And God said, Let there be light: and there was light.
\id Ge
\c 1
\v 4
And God saw the light, that [it was] good: and God divided {the light from...: Heb. between the light and between the darkness}the light from the darkness.


Based on the intent of changing only references, it actually should be this:
Code:
\id Ge
\c 1
\v 1 In the beginning God created the heaven and the earth.
\v 2 And the earth was without form, and void; and darkness [was] upon the face of the deep. And the Spirit of God moved upon the face of the waters.
\v 3 And God said, Let there be light: and there was light.
\v 4 And God saw the light, that [it was] good: and God divided {the light from...: Heb. between the light and between the darkness}the light from the darkness.


In other words, I wasn't clear in my initial explanations:
1. the \id marker only appears once per book, i.e. once per file.
2. the \c marker only occurs when a new chapter is starting.

Oh, I see you've put up a newer version now. Haven't tested it yet. I was still working through how you made the 1st one :)

Thanks!


2013-05-01 10:53:15
Profile

Joined: 2011-01-12 05:32:38
Posts: 256
I just ran v. 2, and it seems to do the footnotes right including putting in the extra reference I wanted (way to go!).

Also, I'm not sure the pilcrow is being handled yet (the \p marker).

Of course, it still has the issue of the repetitive markers I mentioned in the previous post.

And since only Ro > ROM is done, and I used Genesis, the \id didn't work. That is no problem at all.

I've got to sign off again for a while, maybe a day. But I'll check back in later. Thanks so much. This is really nice!


2013-05-01 10:57:10
Profile
User avatar

Joined: 2007-02-07 00:58:12
Posts: 876
Location: Japan
NisusUser wrote:
The "barebones" macro you made give this:

Code:
\id Ge
\c 1
\v 1
In the beginning God created the heaven and the earth.


Based on the intent of changing only references, it actually should be this:
Code:
\id Ge
\c 1
\v 1 In the beginning God created the heaven and the earth.
\v 2 And the earth was without form, and void; and darkness [was] upon the face of the deep. And the Spirit of God moved upon the face of the waters.



I was so caught up in my "didactic moment" I totally overlooked this "detail". Sorry!
But this is an easy problem to fix. Here is version 3 which should fix this and also do pilcrows and work for other books. (Knock on wood.)

Attachment:
Bible v3.nwm [20.48 KiB]
Downloaded 224 times

_________________
philip


2013-05-01 17:44:25
Profile
User avatar

Joined: 2007-02-07 00:58:12
Posts: 876
Location: Japan
And now version 4 also handles deity names (converting them from All-caps to Capitalized.

Attachment:
Bible v4.nwm [20.78 KiB]
Downloaded 265 times

_________________
philip


2013-05-01 19:48:07
Profile
User avatar

Joined: 2007-02-07 00:58:12
Posts: 876
Location: Japan
While the above should do most of what you wanted there is still a small problem with the Pilcrow. Right now the following:
Code:
$$ Ge 1:1
¶ In the beginning GOD created the heaven and the earth.


Will be turned into:
Code:
\id GEN
\c 1
\v 1
\p
In the beginning \nd God\nd* created the heaven and the earth.


But I believe you wanted:

Code:
\id GEN
\c 1
\p
\v 1 In the beginning \nd God\nd* created the heaven and the earth.


In other words it seems when you have a ¶ at the beginning of a verse you seem to want it before the verse marker.
One way to handle this would be to fix this after the fact with Find and Replace. Handling it in the macro might be possible. But one thing I still don't understand is what happens when the ¶ is in the middle of a verse. Do you break the verse into two lines? Does the \p still go on its own line, or does it sit in the middle of the verse line? One would have to know this to fix the problem.

But for what it's worth, here is my interpretation:
Attachment:
Bible v5.nwm [22.46 KiB]
Downloaded 248 times

(file updated)

Otherwise I hope the macro works for you.

Cheers

_________________
philip


Last edited by phspaelti on 2013-05-02 03:55:30, edited 1 time in total.



2013-05-01 20:15:00
Profile

Joined: 2011-01-12 05:32:38
Posts: 256
Quote:
But I believe you wanted:

Code:
\id GEN
\c 1
\p
\v 1 In the beginning \nd God\nd* created the heaven and the earth.


In other words it seems when you have a ¶ at the beginning of a verse you seem to want it before the verse marker.
One way to handle this would be to fix this after the fact with Find and Replace. Handling it in the macro might be possible. But one thing I still don't understand is what happens when the ¶ is in the middle of a verse. Do you break the verse into two lines? Does the \p still go on its own line, or does it sit in the middle of the verse line? One would have to know this to fix the problem.


That is correct: the \c marker must come before the \p one. Otherwise the formatting assigned to \c gets applied to the prose.

The marker \p can be mid-verse – not necessarily at the beginning of the verse or at the beginning of the line. If it's mid-verse, it does not start on a new line. Then there is no new \v marker until the next verse.

Example (assuming the first sentence's 2nd phrase was supposed to start with a new paragraph):

Code:
\id GEN
\c 1
\p
\v 1 In the beginning God created the heaven and the earth.
\v 2 And the earth was without form, and void; \p and darkness [was] upon the face of the deep. And the Spirit of God moved upon the face of the waters.
\v 3 And God said, Let there be light: and there was light.


Re: \nd (names of diety). This is going to take some tweaking. That formatting should only be applied (best I can tell) if it is one isolated word in all caps. That's not foolproof, though, so I'll have to think this through. That's because in SS there is other text in all caps also – citations from the OT presented in the NT.


2013-05-02 00:50:18
Profile
Display posts from previous:  Sort by  
Reply to topic   [ 20 posts ]  Go to page 1, 2  Next

Who is online

Users browsing this forum: No registered users and 2 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
cron
Powered by phpBB® Forum Software © phpBB Group
Designed by ST Software