Finding Shortest +1 doesn’t work as I think it should

Matze · Post by **Matze** » 2014-02-13 09:08:18

Hi,

I have got a text like this:
blabla blablaTABnumberTABblabla blabla bla?TABbla blaTABbla blaRETURN

I want to change the last TAB and the BLABLA and the RETURN by a RETURN

So I search for \t.+?\n
But this finds TABnumberTABblabla blabla bla?TABbla blaTABbla blaRETURN (without the first blabla blabla)

Why? The shortest text between a tab and the return is only the last blabla!
So what do I have to write for „text“? The text contains letters, numbers, punctuation and URL-relevant letters like . , ! „“ ? / : %

Please help.

Groucho · Post by **Groucho** » 2014-02-13 10:03:42

Hello, Matze.

PowerFind Pro is OK. Your regexp finds a string starting with a tab (\t), followed by a string which does not contain a return (.+?) followed by return (\t). By the way, you can do without the non-greedy question mark as period (.) means any character except return.
So, the correct regexp should be:

Code: Select all

.+\t.+\n

In other words: a string of characters (.+) followed by a tab (\t) followed by another string (.+) followed by return (\n).

Best, Henry.

phspaelti · Post by **phspaelti** » 2014-02-13 11:40:38

Matze wrote:So I search for \t.+?\n
But this finds TABnumberTABblabla blabla bla?TABbla blaTABbla blaRETURN (without the first blabla blabla)

Why? The shortest text between a tab and the return is only the last blabla!
So what do I have to write for „text“? The text contains letters, numbers, punctuation and URL-relevant letters like . , ! „“ ? / : %

The problem is that the "shortest" only becomes relevant after it has found the tab. Since it finds the first tab the remaining stuff is the shortest amount of text after that tab.

In my opinion the best option in cases like this is to use the 'not'-set. So you can do the following:

Code: Select all

\t[^\t]*\n

This approach has the advantage that you can use it to select any tab. So the following will pick the last three tabs (and any "blahblah" in-between).

Code: Select all

(:?\t[^\t]*){3}\n

Matze · Post by **Matze** » 2014-02-14 01:11:41

Groucho wrote:Hello, Matze.

PowerFind Pro is OK. Your regexp finds a string starting with a tab (\t), followed by a string which does not contain a return (.+?) followed by return (\t). By the way, you can do without the non-greedy question mark as period (.) means any character except return.
So, the correct regexp should be:
Code: Select all
.+\t.+\n
In other words: a string of characters (.+) followed by a tab (\t) followed by another string (.+) followed by return (\n).

Best, Henry.

This string selects/finds a whole paragraph, Henry.

Matze · Post by **Matze** » 2014-02-14 01:45:03

phspaelti wrote:
Matze wrote:So I search for \t.+?\n
But this finds TABnumberTABblabla blabla bla?TABbla blaTABbla blaRETURN (without the first blabla blabla)

Why? The shortest text between a tab and the return is only the last blabla!
So what do I have to write for „text“? The text contains letters, numbers, punctuation and URL-relevant letters like . , ! „“ ? / : %
The problem is that the "shortest" only becomes relevant after it has found the tab. Since it finds the first tab the remaining stuff is the shortest amount of text after that tab.

In my opinion the best option in cases like this is to use the 'not'-set. So you can do the following:
Code: Select all
\t[^\t]*\n
This approach has the advantage that you can use it to select any tab. So the following will pick the last three tabs (and any "blahblah" in-between).
Code: Select all
(:?\t[^\t]*){3}\n

Cool! Thanks a lot, Philip. I seem to understand the meaning of "." as it was in NWC.
btw, dear Nisus-team: may I recommend, that you rethink the expressions/strings in the dropdownmenu of search?
Strings as you would use it as a common writer.
- I can search for upper or lower characters, but why can't I search for upper AND lower? I know I can, but I have to know the expression for it.
Why not give this option [[:alpha:]] as well?
- I’d like to have a string for the old NWC "." which did found letters, numbers, punctuationmarks and space, so all signs which are usually in a sentence - and only that.
- regarding a "word": the actual "word" finds single numbers, too. A number is not a word, is it?
- why not give a "number"-string in the dropdownmenu, below the Ziffer/digit?, which finds every number: 1 13 0,234 1 Mio 2 Billion -12 3/4 sqaure-expressions Pi and what not.
- regarding "sentences": the actual one finds "sentences" that contain tabs. It even finds parts of an URL as a sentence. Couldn't we have a sentence string, which finds everything inbetween to sentence endings but has no tabs, returns? So just a sentence. May it be descriptive or dialogue.
- regarding "paragraphs": a paragraph contains all sentences AND the return, that seperates it from the text below. The actual paragraph string (?:^.+$) finds it without the return. When I copy such a paragraph and then paste it into some text/paragraph, it becomes part of that text/paragraph and I have to hit return to make it the paragraph it has been before.

Edit: One more major request: In NWC there was a button in f/r, that created a context list of all what was found. Can we have that back, please?
Best regards, Matze

phspaelti · Post by **phspaelti** » 2014-02-14 07:09:46

Matze wrote:btw, dear Nisus-team: may I recommend, that you rethink the expressions/strings in the dropdownmenu of search?
Strings as you would use it as a common writer.
- you can search for upper and lower, but why can't I search for upper and lower? I know I can, but I have to now the expression for it.
Why not give this option [[:alpha:]] as well?

I totally agree with this one. This should be added to the wildcard menu

Matze wrote:- the old NWC "." which did found letters, numbers, punctuationmarks and space, so all signs which are usually in a sentence.

This one is there. It's called "AnyTextCharacter".

Matze wrote:- regarding a "word": the actual "word" finds single numbers, too. A number is not a word, is it?

This is a well-known technical issue. "Word" here is used in the computer technical sense, which is a string of "word characters". From the computer technician's point of view word characters stand in opposition to "white-space", punctuation, and new-line/line-feed, so yes they include numbers. Nisus really can't do anything about this, since they don't really write the find/replace engine themselves (I believe). And it would break more things than it would fix.

But for yourself perhaps [[:alpha:]]+ would work as a definition of word? The real problem however is that 'word', as you think of it, is a linguistic notion, which really can't be defined in a generally satisfying way. Is "there's" one word or two? If two, is the second [s] or ['s]? In a list, A. B. C., etc. probably shouldn't be words but at the beginning of a sentence "A" is of course a word, and on and on.

Matze wrote:- why not give a "number"-string in the dropdownmenu, below the Ziffer/digit?, which finds every number: 1 13 0,234 1 Mio 2 Billion -12 3/4 sqaure-expressions Pi and what not.

Sounds like a good idea. Nisus should probably consider editing and expanding the list of predefined wild-cards. But remember that you can also save your own expressions, and you can even give them names. They will then be listed under "Saved expressions". (I keep forgetting this feature myself.)

Matze wrote:- regarding "sentences": the actual one finds "sentences" that contain tabs. It even finds parts of an URL as a sentence. Couldn't we have a sentence string, which finds everything inbetween to sentence endings but has no tabs, returns? So just a sentence. May it be descriptive or dialogue.

Sentences are a well known nightmare. Again they are a linguistic notion, so there is no satisfactory solution. But the idea of excluding tabs strikes me as a good one. Here is today's attempt at a better definition of sentence:

Code: Select all

(?:["“„'‘‚]?[[:upper:]][^\t]+?[\.\?\!]["”‟'’‛]?)(?= |$)

But of course it will still catch things that aren't sentences.

Matze wrote:- regarding "paragraphs": a paragraph contains all sentences AND the return, that seperates it from the text below. The actual paragraph string (?:^.+$) finds it without the return. When I copy such a paragraph and then paste it into some text/paragraph, it becomes part of that text/paragraph and I have to hit return to make it the paragraph it has been before.

This opens another can of worms. Maybe they did it this way because of the change to non-contiguous copy (which now adds newlines)? Again define your own (?:^.+\n) and save it as an expression. But note that this one will not catch the last paragraph in a document (unless it has an actual newline).

Matze wrote:One more major request: In NWC there was a button in f/r, that created a context list of all what was found. Can we have that back, please?

Didn't we just go over this one recently? (Did you ever try my macro?

)

Matze · Post by **Matze** » 2014-02-14 07:35:44

Dear Philip,

I am just leaving the office. So only this for now: thanks for your thoughts, I will answer them tomorrow.

And: yes I am using your list macro intensivley! Thanks again. But I’d find it nice to have the option to activate a list when I already have found numerous hits. If there are for example 23 hits and I thought there should have been only two I would like to see them all in a list, just by clicking on a button or something.

Until tomorrow! Matze

Matze · Post by **Matze** » 2014-02-16 01:15:29

Matze wrote:- the old NWC "." which did found letters, numbers, punctuationmarks and space, so all signs which are usually in a sentence.

phspaelti wrote:This one is there. It's called "AnyTextCharacter".

Did "." in NWC find a tab as well? I thought it didn't

Matze wrote:- regarding a "word": the actual "word" finds single numbers, too. A number is not a word, is it?

phspaelti wrote:This is a well-known technical issue."Word" here is used in the computer technical sense, which is a string of "word characters". From the computer technician's point of view word characters stand in opposition to "white-space", punctuation, and new-line/line-feed, so yes they include numbers. Nisus really can't do anything about this, since they don't really write the find/replace engine themselves (I believe). And it would break more things than it would fix.

But for yourself perhaps [[:alpha:]]+ would work as a definition of word?

Definitely! I will save that as my personal "word"-search-term. But maybe it is a good idee to add that to the wild card, too. I guess, most people just write text with a word processor and regardig that NWPs find/replace should be very easy to understand yet very powerfull to use. So if there is the word "word" in the wild card and it doesn't mean word, as most people, as I, would presume it should be, then the real NWP-meaning of "word" and all the other terms should be given, best in find/replace. (In Nisus' help file the meaning of Any Word is "any word" ...)
Moreover a AnyRealWord or something would help:

phspaelti wrote:The real problem however is that 'word', as you think of it, is a linguistic notion, which really can't be defined in a generally satisfying way. Is "there's" one word or two? If two, is the second [s] or ['s]? In a list, A. B. C., etc. probably shouldn't be words but at the beginning of a sentence "A" is of course a word, and on and on.

Actual NWP's Any Word finds there and s, and this is fine to me. But it shouldn't find "1" and "A.".

Matze wrote:- why not give a "number"-string in the dropdownmenu, below the Ziffer/digit?, which finds every number: 1 13 0,234 1 Mio 2 Billion -12 3/4 square-expressions Pi and what not.

phspaelti wrote:Sounds like a good idea. Nisus should probably consider editing and expanding the list of predefined wild-cards. But remember that you can also save your own expressions, and you can even give them names. They will then be listed under "Saved expressions". (I keep forgetting this feature myself.)

Yes, I do remember that, and I am saving

But for all those who are not using NWP yet, it would be a good argument to become a NWP-user, if there were more "linguistic" wild-cards.

phspaelti wrote:Sentences are a well known nightmare. Again they are a linguistic notion, so there is no satisfactory solution. But the idea of excluding tabs strikes me as a good one. Here is today's attempt at a better definition of sentence:
Code: Select all
(?:["“„'‘‚]?[[:upper:]][^\t]+?[\.\?\!]["”‟'’‛]?)(?= |$)
But of course it will still catch things that aren't sentences.

Well, if one can't define a sentence, then one shouldn't name a wild card a "sentence". But how to name it instead?
Kind of tricky. Sigh.

nisus.com

Finding Shortest +1 doesn’t work as I think it should

Finding Shortest +1 doesn’t work as I think it should

Re: Finding Shortest +1 doesn’t work as I think it should

Re: Finding Shortest +1 doesn’t work as I think it should

Re: Finding Shortest +1 doesn’t work as I think it should

Re: Finding Shortest +1 doesn’t work as I think it should

Re: Finding Shortest +1 doesn’t work as I think it should

Re: Finding Shortest +1 doesn’t work as I think it should

Re: Finding Shortest +1 doesn’t work as I think it should