Reply to topic  [ 9 posts ] 
how to find a word with certain letter, and highlight all 
Author Message
User avatar

Joined: 2008-01-16 11:09:49
Posts: 13
Location: Central Asia
Hoping some magic might happen here.

I need to be able to find all words in a document which have and 1+ of 4 particular letters, but NOT 1 or 2 of any other 3 letters, and then highlight/select what is found so I can copy it all to paste into another document.

More detail:

In Cyrillic Kazak, I need to find all words with any of the following vowels, ә, ө, ү, і, (04D8/04D9, 04E8/04E9, 04AE/04AF, 0406/0456, these are Upper/lower cases) but NOT with any of the following consonants, г,к, (0413/0433, 041A/043A) nor е (0415/0435)

For some examples, I need to find words like үй, ән, бір, түсу, бір-бірі, but not words like күй, үйім, бірге, түсемін, әңгіме.

This is because I am moving Kazak cyrillic files to Kazak arabic, and in Kazak arabic, the г, к, or е, all indicate that the vowels in the word are "soft", but if there is no г, к, or е, then a hamza is needed to tell the reader that a given vowel is soft (ә sound, not а sound; ө sound, not о sound; ү sound, not ұ sound; і sound, not ы sound). If I can easily find all the cyrillic words in the original files, I can copy them into a master list and use it as a reference, or turn it into a big macro list.

_________________
\


2008-12-14 07:00:07
Profile WWW
Official Nisus Person
User avatar

Joined: 2002-07-11 17:14:10
Posts: 4251
Location: San Diego, CA
This is essentially a PowerFind problem. Let's start with just trying to find all of the words with the desired vowels. The PowerFind expression for that would be:
Code:
\b\w*[әөүі]\w*\b

The "\b" is the special marker for word boundary, and "\w*" means zero or more "word characters". And "[әөүі]" is the character set containing just the vowels you want. It looks like you also want to include hyphenated words, so you'd want to combine that with "\w" like so:
Code:
\b[\-\w]*[әөүі][\-\w]*\b


The next step is to turn the general "\w" into an expression that rejects the letters you want to exclude from the match. We can do that using the "&&" character set operator:
Code:
[[\-\w]&&[^гке]]

That expression means any word character, but only if that character isn't part of "гке".

So, putting it all together, you have:
Code:
\b[[\-\w]&&[^гке]]*[әөүі][[\-\w]&&[^гке]]*\b

Be sure to turn off the "case insensitive" option, or the find won't do what you expect/want.

Now that's out of the way, if you want to turn this into a macro that copies all the words into a new file:
Code:
# gather all words
Find All '\b[[\-\w]&&[^гке]]*[әөүі][[\-\w]&&[^гке]]*\b', 'E-i'
$doc = Document.active
$sels = $doc.textSelections

# create new document with all words
New
ForEach $sel in $sels
   $word = $sel.subtext
   Type Text $word
   Type Text "\n"
End

# sort the words
Select All
Menu ':Edit:Sort Paragraphs:Ascending (A-Z)'


2008-12-15 14:49:01
Profile WWW
User avatar

Joined: 2008-01-16 11:09:49
Posts: 13
Location: Central Asia
Wow. You did it. Now I am going to go over the incredibly helpful explanations you gave so that I can understand it too! Huge thanks to you, thank you!

_________________
\


2008-12-25 20:32:24
Profile WWW

Joined: 2008-05-17 04:02:32
Posts: 400
Martin’s find expression may not fit what Scooke wants to do, which, in my understanding, is to find үй, ән, ҮЙ, ӘН, etc. excluding күй, бірге, КҮЙ, БІРГЕ, etc. Also, it seems that [[\-\w]&&[^гке]] does not exclude ГКЕ even in the case insensitive mode.

I think the following macro will do the job. I’m not certain about үйім. Is it a typo?
Code:
# gather all words
$numFound = Find All '\b(?<c>[[-\p{Cyrillic}]&&[^ГгКкЕе]])*[ӘәӨөҮүІі]\g<c>*\b', 'E'
if ! $numFound
   Exit 'Nothing found, exit...'
end
$doc = Document.active
$words = $doc.selectedSubtexts

# sort the words and join them with \n as separator
$words.sort 'li'  # Intl pref sort order and case insensitive
$words = $words.join "\n"
$words &= "\n"

# paste the word list on a new document
New
Type Text $words


2008-12-28 08:01:56
Profile
User avatar

Joined: 2008-01-16 11:09:49
Posts: 13
Location: Central Asia
Well, it seems you may have a point. On one test file your solution found one word Martin's didn't, "Әрқайсысы". It seemed his worked well because on another file it did find all the pertinent words (I proofread it to make sure).

So, another big thank you to you too, much appreciated.

Here is another tougher one which I haven't been able to crack is going the other way!

To recap, when I simply paste "\b(?<c>[[-\p{Cyrillic}]&&[^ГгКкЕе]])*[ӘәӨөҮүІі]\g<c>*\b" into the Powerfind box, I then put "\0ٴ" in the Replace box (this places a hamza in front of every instance of those words.

Then I run a straight forward letter->letter Find and Replace, Аа->ا, Әә->ا, Бб->ب, Гг->گ, Ғғ->ع, Оо->و, Өө->و, Сс->س, etc., (You may notice different Cyrillic vowels use the same Arabic vowel. The hamza, or a G, K, or E, (г,к,е) tell the reader whether the vowel is "hard" or "soft". Once that is finished, voila, I have a file transliterated from Cyrillic to Koneshe (the word they use, not "arabic).

The tougher problem is going the other way. A straight letter-> conversion doesn't work for the vowels since 4 Cyrillic vowels use the one same Koneshe vowel. How to tell Nisus that if there is a word with an ا, و, ۇ, ى WITH a hamza, or گ, ك, ە then replace
ا = ә
و = ө
ۇ = ү
ى = і

Once that is finished then I could run the rest of the letter->letter conversions, and the left over vowels would be as so:
ا = а
و = о
ۇ = ұ
ى = і

I could deal with capitalization later.

I tried this "\b(?<c>[[-\p{Arabic}]&&[ٴگكە]])*[اوۇى]\g<c>*\b" and this "\b(?<c>[[-\p{Arabic}]&&[ٴORگORكORە]])*[اORوORۇORى]\g<c>*\b"but since the letters join it looks for that combo and it finds one word, "ەكى". I tried this "\b(?<c>[[-\p{Arabic}]&&[ا][و][ۇ][ى]])*[ٴ][گ][ك][ە]]\g<c>*\b" but it goes out of order and doesn't compute.

The uicode codes are as follows, but I am not sure how to implement them in a Find and Replace query:
0627 - ا
0648 - و
06C7 - ۇ
0649 - ى
06AF - گ
0643 - ك
06C1 - ە
0674 - ٴ

Well, there's my ramblings. Thanks again!

_________________
\


2008-12-29 00:26:46
Profile WWW

Joined: 2008-05-17 04:02:32
Posts: 400
Hello again,

I may misunderstand you but you can find/select “Ә, Ө, Ү, І, ә, ө, ү or і” preceded by “a hamza (U+0674), Г, К, Е, г, к or е” by this command:
Code:
Find All '(?<=[\x{674}ГКЕгке])[ӘӨҮІәөүі]', 'E-i'
Then, you can use the following macro to arabicize the Cyrillica vowels in selections (s option) and preserving the selections (S option which has no corresponding GUI element):
Code:
Replace All 'Ә|ә', 'ا', 'ESs-i'  # | stands for OR
Replace All 'Ө|ө', 'و', 'ESs-i'
Replace All 'Ү|ү', 'ۇ', 'ESs-i'
Replace All 'І|і', 'ى', 'ESs-i'

Does this solve the problem?


2008-12-29 05:52:46
Profile
User avatar

Joined: 2008-01-16 11:09:49
Posts: 13
Location: Central Asia
I'm actually trying to go from an arabicized file (something that was put through the wonderful previous solutions) BACK to Cyrillic. I have several files to go into Kazak arabic and the above solutions work great; BUT, as well, several to go back into Cyrillic. Somehow going backwards though this process isn't as easy.

Sorry for not being clearer.

The situation would be to Find all Kazak arabic words with either A) a hamza on the front: and then somehow select those words to Replace their vowels to the correct "soft" vowels in Cyrillic; B) a G, K, or e, because the presence of those consonants also indicate the vowels are soft, and then Replace the vowels with correct "soft" vowels in Cyrillic.

After I convert the soft vowels, I then could run a simple letter->letter Find and Replace. It isn't the easy Replace going Cyrillic -> Arabic (because hard and soft Cyrillic vowels use the same Arabic character (АаӘә=ا, but ا= either Аа, or, Әә, depending on the presence of a hamza, or the g, k, or e; ٴان=ән, ان=ан) because going Arabic -> Cyrillic means I need to find a way to identify the words with soft vowels so I can Replace them first, and then do the hard vowels.

_________________
\


2008-12-29 08:47:45
Profile WWW

Joined: 2008-05-17 04:02:32
Posts: 400
So is this what are you looking for?

1. Select words beginning with a hamza (\x{6AF}) or containing G, K, or e (\x{643}, \x{6C1}, \x{674}).
Code:
Find All '\b(?:\x{6AF}\w+|\w*[\x{643}\x{6C1}\x{674}]\w*)\b', 'E'
# \b: word boudary
# (?:   ): non-captured group
# \w: any alphabetical letter
# +: 1 or more time
# *: 0 or more time
# |: OR operator
# [ABC]: A or B or C

2. Replace their vowels to the "soft" vowels in Cyrillic:
Code:
Replace All 'ا', 'ә', 'ESs-i'
Replace All 'و', 'ө', 'ESs-i'
Replace All 'ۇ', 'ү', 'ESs-i'
Replace All 'ى', 'і', 'ESs-i'
# E: PowerFind Pro
# S: Preserve Selection
# s: In Selection
# -i: Case Insensitive (not necessary here)


2008-12-29 17:49:33
Profile
User avatar

Joined: 2008-01-16 11:09:49
Posts: 13
Location: Central Asia
Kino, Martin, thank you. I have been trying for some time to come up with solutions to these problems. I had one back in OS 9 days using some grep-based app, but since the move to OS X and unicode I haven't had much success, basically going through word by word, letter by letter, and changing by hand.

These two solutions work wonders. Save time. And I am learning by them too with your explanations of the syntax.

I hope these can be of use to others with similar needs.

_________________
\


2008-12-29 19:38:29
Profile WWW
Display posts from previous:  Sort by  
Reply to topic   [ 9 posts ] 

Who is online

Users browsing this forum: No registered users and 3 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
cron
Powered by phpBB® Forum Software © phpBB Group
Designed by ST Software