Reply to topic  [ 9 posts ] 
Create Index for Japanese Documents 
Author Message

Joined: 2008-05-17 04:02:32
Posts: 400
Here is a macro which [1] indexes Japanese words in the frontmost document as “yomi<space>word”, [2] inserts Index, and [3] remove “yomi” necessary only for sorting Index entries.
Code:
### Generate Index ###

# This is a sample NWP macro to generate Index for Japanese documents.
# Run against “kusabana.rtf”, it generates Index at the end of the file.
# For other documents, you have to define as many $item as needed.
# This macro indexes a word as “yomi<space>word” of which the yomi works as sort key.
# After the generation of index, “yomi” will be removed with trailing space.

$numColumn = 1 # Number of columns
$removeYomi = @true

$doc = Document.active
if ! $doc
   exit 'No open document, exiting...'
end

Debug.setDestination 'new'

$item = Hash.new
   # $item{'よみ 索引対象語'} = Cast to String 'Find Expression'
$item{'がんぴ 眼皮'} = Cast to String '眼皮'
$item{'せきちく 石竹'} = Cast to String '石竹'
$item{'だるまだいし 達磨大師'} = Cast to String '達磨大師'
$item{'わかんさんさいずえ 和漢三才図会'} = Cast to String '和漢三才図会'
$item{'かあねいしょん カーネイション'} = Cast to String 'カーネイション'
$item{'ちょうじ 丁子'} = Cast to String '丁子'
$item{'なでしこ ナデシコ'} = Cast to String '(?<!\p{Katakana})ナデシコ'
   # 'ナデシコ' not preceded by any Katakana leter (\p{Katakana})
$item{'なでしこ ナデシコ:からなでしこ カラナデシコ'} = Cast to String 'カラナデシコ'
$item{'なでしこ ナデシコ:こなでしこ コナデシコ'} = Cast to String 'コナデシコ'
$item{'なでしこ ナデシコ:むしとりなでしこ ムシトリナデシコ'} = Cast to String 'ムシトリナデシコ'
$item{'ふじわらのさだいえ 藤原定家'} = Cast to String '定家'
$item{'ふじわらのさだいえ 藤原定家:しんちょくせんしゅう 『新勅撰集』'} = Cast to String '新勅撰集'
$item{'ぱあすれい パースレイ'} = Cast to String 'パースレイ'
$item{'ありすとてれす アリストテレス'} = Cast to String 'アリストテレス'
$item{'だりや ダ(ー)リヤ'} = Cast to String 'ダリヤ|ダーリヤ'
   # 'ダリヤ' or 'ダーリヤ'
$item{'かゔあにゅす カヴアニュス'} = Cast to String 'カヴアニュス'
$item{'かゔあにゅす カヴアニュス:すぺいんしょくぶつずせつ 西班牙植物図説'} = Cast to String '西班牙植物図説'
$item{'はやしじゅっさい 林述斎'} = Cast to String '林述斎'
$item{'わかつきれいじろう 若槻禮次郞'} = Cast to String '若槻首相'
$item{'きんげんぞう 金源三'} = Cast to String '金源三' # ??? I don’t know
$item{'みなもとのとしより 源俊頼'} = Cast to String '俊頼'
$item{'にほん 日本'} = Cast to String '(?<!\p{Han})日本(?!\p{Han})'
   # '日本' not preceded nor followed by any Kanji leter (\p{Han})
$item{'にほん 日本:━じん ━人'} = Cast to String '(?<!\p{Han})日本人(?!\p{Han})'
   # '日本人' not preceded nor followed by any Kanji leter (\p{Han})
$item{'にほん 日本:━および━ 「━及━」'} = Cast to String '「日本及日本人」'
$item{''} = Cast to String ''   # Empty $item does not affect, it seems
$item{''} = Cast to String ''


$errors = Array.new

foreach $i in $item.keys
   $sels = $doc.text.findAll $item{$i}, 'E-i', '-am'
   if $sels.count
      Push Target Selection $sels
         Add to Text Index As $i
      Pop Target Selection
   else
      $errors.appendValue $item{$i}
   end
end

# Define cross-references
$crossRef = Hash.new
$crossRef{'ていか 定家'} = Cast to String '藤原定家'
$crossRef{'としより 俊頼'} = Cast to String '源俊頼'

foreach $i in $crossRef.keys
   $sel = $doc.text.find $crossRef{$i}, 'E-ir'
   $crossRefString = $crossRef{$i}
   if $sel
      Push Target Selection $sels
         Add to Text Index As $i, $crossRefString
      Pop Target Selection
   else
      $errors.appendValue $crossRef{$i}
   end
end

# Exclude words in “Not for Index” characrer style from Index
$style = $doc.styleWithName 'Not for Index'
$exclude = $doc.text.findAll $style
if $exclude.count
   Push Target Selection $exclude
      Remove Text Indexing
   Pop Target Selection
end

Select Document End
Document.setActive $doc
Menu.activateAtPath(':Tools:Index:Insert Index')
Send Text $numColumn 
Press Button 'Insert'

if $removeYomi
   Document.setActive $doc
   Replace All '^\S+ ', '', 'EsS-i'
   # Remove visible characters at each paragraph start together with trailing space
end

if $errors.count
   $errmsg = Cast to String 'Failed: '
   foreach $err in $errors
      $err = $errmsg & $err
      Debug.log $err
   end
end

### end of macro

With a standard contemporary Japanese document, perhaps it would be safer to enclose all Katakana words between (?<!\p{Katakana}|\x{30FC}) and (?!\p{Katakana}|\x{30FC}) (\x{30FC} stands for “ー”), and all Kanji words between (?<!\p{Han}) and (?!\p{Han}) so that, for example, “シリア” will not match “シリアノス” nor “アッシリア”. Of course, that depends on how the document is written.
Attachment:
File comment: Macro and a sample file “kusabana.rtf”
Macro&SampleFile.zip [21.34 KiB]
Downloaded 54 times


Last edited by Kino on 2017-03-31 11:54:38, edited 3 times in total.



2017-03-31 10:22:42
Profile

Joined: 2008-05-17 04:02:32
Posts: 400
It is tedious to type yomi (pronunciation of Kanji word represented by Hiragana or Katakana letters) for each Index entry. Here is a macro which inserts “yomi<space>” before selected Japanese word(s).
Code:
### Prepend Yomi ###

# This macro add yomi before selected Japanese word(s) in Kanji, Katakana, or Hiragana.
# This macro requires mecab <http://taku910.github.io/mecab/> installed manually or via Homebrew, MacPorts, etc.
# Dictionary used by mecab (usually “ipadic”) should be encoded in UTF-8.
# If you use Homebrew, the following Terminal commands will do the job.
#    brew install mecab
#    brew install mecab-ipadic
# If you have mecab installed in a directory other thaqn /usr/local/bin, modify the value of $mecabPath accordingly.
# Yomi returned by mecab is not always correct. Add such words to $defYomi as many as needed.
# This macro requires a Services module 'Transform Clipboard by Shell Script.workflow' in /Users/<you>/Library/Services as follows:
#   on run
#      set scriptvar to (the clipboard)
#      set the clipboard to (do shell script scriptvar)
#   end run
# which execute a shell script in the clipbord and write back its output to the clipboard.


$mecabPath = '/usr/local/bin/mecab'
$betweenYomi = '*'
$afterYomi = Cast to String " "
$VowelizeProlongedSoundMark = @true

$defYomi = Hash.new   # Add words mecab does not understand well
$defYomi{'形相'} = Cast to String 'けいそう'   # mecab returns 'ぎょうそう'
$defYomi{'本性'} = Cast to String 'ほんせい'   # mecab returns 'ほんしょう'
$defYomi{'離存'} = Cast to String 'りそん'   # mecab does not know the word
$defYomi{'冒瀆'} = Cast to String 'ぼうとく'   # mecab does not understand no-JIS X 0208 chars
$defYomi{'碧巌録'} = Cast to String 'へきがんろく'   # mecab does not know the word
$defYomi{'西班牙'} = Cast to String 'すぺいん'   # mecab does not know the word

$doc = Document.active
if ! $doc
   exit 'No open document, exiting...'
end

$sels = $doc.textSelections
if ! $sels.firstValue.length
   exit 'No selection, exiting...'
end

$str = $doc.selectedSubstrings.join($betweenYomi)
$script = "echo '$str' | $mecabPath -O yomi | perl -Mutf8 -CS -pe 'tr/[ァ-ヶヽヾ]/[ぁ-ゖゝゞ]/'"
Write Clipboard $script

Menu.activateAtPath('Services:Transform Clipboard by Shell Script')

$converted = Read Clipboard

while $converted == $script
   Sleep 0.2
   $converted = Read Clipboard
end

if $VowelizeProlongedSoundMark   # Necessary for correct sort order
   $converted.replaceAll '(?<=[あぁかゕがさざただなはばぱまやゃらわゎ])ー', 'あ', 'E'
   $converted.replaceAll '(?<=[いぃきぎしじちぢにひびぴみりゐ])ー', 'い', 'E'
   $converted.replaceAll '(?<=[うぅゔくぐすずつづぬふぶぷむゆゅる])ー', 'う', 'E'
   $converted.replaceAll '(?<=[えぇけゖげせぜてでねへべぺめえれゑ])ー', 'え', 'E'
   $converted.replaceAll '(?<=[おぉこごそぞとどのほぼぽもよょろを])ー', 'お', 'E'
end

$converted = Cast to Attributed String $converted
$ja = Language.languageWithCode('ja')
$fontName = $sels[0].text.displayAttributesAtIndex($sels[0].location).fontName

Push Target Text $converted
   $ja.apply      # Perhaps necessary for correct sort order
   Set Font Name $fontName
Pop Target Text

$converted = $converted.split($betweenYomi)

foreach $sel in reversed $sels
   $yomi = $converted.pop & $afterYomi
   $sel.text.insertAtIndex $sel.location, $yomi
end

### end of macro

The macro relies on “mecab” (Unix program) and “Transform Clipboard by Shell Script.workflow” (Services module). The latter is a slightly modified version of Philip Spaelti’s “Run Shell Script from Clipboard” https://nisus.com/forum/viewtopic.php?f=17&t=5948#p27249.

This macro is terribly slow in comparison with a macro which calls mecab via perl which was killed by Sandbox ;-( So running it with a single word selected is very irritating. It would be better to run it on multiple non-contiguous selections.

FEATURE REQUEST! Menu command which converts selected Japanese words into their yomi (pronunciation of Kanji/Katakana word represented by Hiragana or Katakana letters) and a macro command such as “Convert to Yomi textObject, [Hiragana/Katakana]”. I think OS X has API which makes possible those features because Jedit X http://www.artman21.com/en/jedit_x/ has such a menu command “:Tools:Kanji to Hiragana” since longtime ago.
Attachment:
Macro&ServicesModule.zip [36.36 KiB]
Downloaded 52 times


Last edited by Kino on 2017-03-31 11:45:31, edited 2 times in total.



2017-03-31 10:59:08
Profile

Joined: 2008-05-17 04:02:32
Posts: 400
And here is a macro creating a Japanese word list from the frontmost document, that you might find useful.
Code:
### Japanese Word List ###

# Create a list of Katakana or Kanji words from the frontmost document.

$lang = Language.languageWithCode 'ja'
# zh_TW for Traditional Chinese and zh_CN for Simplified one

Debug.setCodeProfilingEnabled false

$LF = Cast to String "\n"
$tab = Cast to String "\t"
$doc = Document.active
if $doc == undefined
   exit 'No open document, exiting...'
end

$findExp = Hash.new  # a, b, c,... are used as sort keys

$findExp{'a. Katakana words including ジャン゠ソウル・パルトル, F・カフカ, Ch. ミュンシュ'} = '(?:[\x{FF21}-\x{FF3A}](?<p>[\x{30A0}\x{30FB}\x{FF1D}])|\p{Upper}\p{Lower}?\.(?:[-\x20]\p{Upper}\p{Lower}?\.)*\x20)*(?<k>\p{Katakana}[\p{Katakana}\x{30FC}]*)(?:(?:[\x{30A0}\x{30FB}\x{FF1D}][\x{FF21}-\x{FF3A}])?\g<p>\g<k>)*'
$findExp{'b. Katakana/Kanji words such as 老子, 新プラトン主義, フィリップ・K・ディック'} = '(?:[\x{FF21}-\x{FF3A}](?<p>[\x{30A0}\x{30FB}\x{FF1D}])|\p{Upper}\p{Lower}?\.(?:[-\x20]\p{Upper}\p{Lower}?\.)*\x20)*(?:(?<k>\p{Katakana}[\p{Katakana}\x{30FC}]*)(?:(?:[\x{30A0}\x{30FB}\x{FF1D}][\x{FF21}-\x{FF3A}])?\g<p>\g<k>)*|\p{Han}+(?:\g<p>\p{Han}+)*)+'
$findExp{'c. Separate Kanji chars'} = '\p{Han}'
$findExp{'d. One or more consecutive Kanji chars'} = '\p{Han}+'
$findExp{'e. Two or more consecutive Kanji chars'} = '\p{Han}{2,}'
$findExp{'f. Kanji followed by 2 Hiragana'} = '\p{Han}+\p{Hiragana}{1,2}'

$menuItems = $findExp.keys
$menuItems.sort
$input = Prompt Options 'List up . . .', '', '', $menuItems
$selections = $doc.selectedSubstrings

if $selections.firstValue.length
   $LF = Cast to String "\n"
   $selections = $selections.join $LF
   $sels = $selections.findAll $findExp{$input}, 'E-i'
else
   $sels = $doc.text.findAll $findExp{$input}, 'E-i'
end

if ! $sels.count
   exit 'No word found, exiting...'
end

$str = Hash.new
foreach $sel in $sels
   $str{$sel.substring} += 1
end

$words = $str.keys
$words.sort 'li', $lang
foreach $i, $word in $words
   $ocr = $str{$word} & $tab
   $words[$i] = $ocr & $word
end
$words = $words.join $LF
Push Target Text $words
   $lang.apply
Pop Target Text
Document.newWithText $words

### end of macro ###

Attachment:
File comment: Macro file
JapaneseWordList.nwm.zip [5.43 KiB]
Downloaded 52 times


Last edited by Kino on 2017-03-31 11:45:54, edited 1 time in total.



2017-03-31 11:10:27
Profile

Joined: 2008-05-17 04:02:32
Posts: 400
With those macros (not exactly the same because of the necessity for applying special formatting), I created 索引 (index) for every volume of『井筒俊彦全集』(The Complete Works of Toshihiko IZUTSU, 12 vols and 1 supplement) http://www.keio-up.co.jp/kup/izutsu/cw.html https://www.keio-up.co.jp/np/search_result.do?ser_id=73 except 総索引 (general index of all the volumes included in the last volume 別巻). Impossible if I had not been a NWP user. Thousand thanks, Mr. Nisus and Nice Us people! :-)


2017-03-31 11:37:46
Profile
User avatar

Joined: 2007-02-07 00:58:12
Posts: 876
Location: Japan
Kino wrote:
FEATURE REQUEST! Menu command which converts selected Japanese words into their yomi (pronunciation of Kanji/Katakana word represented by Hiragana or Katakana letters) and a macro command such as “Convert to Yomi textObject, [Hiragana/Katakana]”. I think OS X has API which makes possible those features because Jedit X http://www.artman21.com/en/jedit_x/ has such a menu command “:Tools:Kanji to Hiragana” since longtime ago.

I second that feature request

_________________
philip


2017-03-31 19:40:31
Profile

Joined: 2007-03-28 07:30:34
Posts: 139
I third the feature request.


2017-04-01 08:34:15
Profile

Joined: 2007-01-17 05:46:17
Posts: 145
Location: Tokyo, Japan
This is an essential feature. Please add it to NW!

_________________
Best regards,

Nobumi Iyanaga
Tokyo,
Japan


2017-04-01 21:15:38
Profile WWW
User avatar

Joined: 2007-03-31 14:59:22
Posts: 333
Location: Frankfurt, Germany
Hi Kino –

I love to read Toshihiko IZUTSU – but sorry English only! This Teheran Connection continues to work with Sachiko MURATA – great reading too!

HE

Kino wrote:
With those macros (not exactly the same because of the necessity for applying special formatting), I created 索引 (index) for every volume of『井筒俊彦全集』(The Complete Works of Toshihiko IZUTSU, 12 vols and 1 supplement) http://www.keio-up.co.jp/kup/izutsu/cw.html https://www.keio-up.co.jp/np/search_result.do?ser_id=73 except 総索引 (general index of all the volumes included in the last volume 別巻). Impossible if I had not been a NWP user. Thousand thanks, Mr. Nisus and Nice Us people! :-)

_________________
MacBook Pro i5
SSD 840/850 Pro
macOS Sierra 10.12.6
Nisus Writer Pro 2.1.8


2017-04-12 00:08:31
Profile
Official Nisus Person
User avatar

Joined: 2002-07-11 17:14:10
Posts: 4251
Location: San Diego, CA
phspaelti wrote:
Kino wrote:
FEATURE REQUEST! Menu command which converts selected Japanese words into their yomi (pronunciation of Kanji/Katakana word represented by Hiragana or Katakana letters) and a macro command such as “Convert to Yomi textObject, [Hiragana/Katakana]”. I think OS X has API which makes possible those features because Jedit X http://www.artman21.com/en/jedit_x/ has such a menu command “:Tools:Kanji to Hiragana” since longtime ago.

I second that feature request

Thank you to all of you for letting us know this was important to you. I've added the potential enhancement to our issue tracker.


2017-05-11 16:48:36
Profile WWW
Display posts from previous:  Sort by  
Reply to topic   [ 9 posts ] 

Who is online

Users browsing this forum: No registered users and 2 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
Powered by phpBB® Forum Software © phpBB Group
Designed by ST Software