RegEx for a string between two symbols

I’m looking to build a dictionary application. One part that I am having
difficult is setting up the RegEx for the hash.

I would like to structure it something like this

:word-definition;
:word2-definition2;

etc.

Can someone please help me with the regex which then I can push into a
hash.

Thanks!

Hi,

I don’t really get what you’re trying to do. Do you want to store the
dictionary data as structured text and then parse it and put it into a
Ruby hash? And do you want to use Ruby’s syntax to make up your own data
format?

If so: Why not use a common data format like XML or JSON? Those are
mature and already have parsers. And if you’re going to have a lot of
data, you’re better off using a database, anyway.

Jan E. wrote in post #1054912:

Hi,

I don’t really get what you’re trying to do. Do you want to store the
dictionary data as structured text and then parse it and put it into a
Ruby hash? And do you want to use Ruby’s syntax to make up your own data
format?

If so: Why not use a common data format like XML or JSON? Those are
mature and already have parsers. And if you’re going to have a lot of
data, you’re better off using a database, anyway.

Hey, thanks for the reply. I’ve been practising my ruby but if you have
a JSON or XML solution I will look at / learn those too as Ruby is often
used in tandem with XML and JSON

So I’m looking to start up a text file where I can type words and then
its corresponding definition on the same line. For example.

In dictionary.txt I would have
word1 - definition1;
word2 - definition2;

Then in my ruby program I will have a hash which will have the key as
the word and the value as the definition.

From there I was thinking of adding additional functionality such as a
random definition selector and then I could type the applicable word. I
was also thinking that I could provide 4 words and a definition and see
if I knew which one matched.

Christopher D. wrote in post #1054934:

Hey, thanks for the reply. I’ve been practising my ruby but if you have
a JSON or XML solution I will look at / learn those too as Ruby is often
used in tandem with XML and JSON

So I’m looking to start up a text file where I can type words and then
its corresponding definition on the same line. For example.

In dictionary.txt I would have
word1 - definition1;
word2 - definition2;

If it’s only one word and one definition per line, then XML or JSON is
probably too much. But I wouldn’t use a hyphen as a separator, because
it may also appear in words or definitions. Use something special like a
double colon:

word :: definition

You don’t have to mark the end of the definition, because you can simply
read the whole line.

The “parser” could be something like this:

dictionary = {}
File.foreach ‘C:/dictionary.txt’ do |line|
next if line.strip.empty? # skip empty lines
parts = line.split(’::’).map &:strip # split line at the double colon
if parts.length == 2 and parts.none? &:empty?
word, definition = parts
raise “multiple definitions for #{word}” if
dictionary.has_key? word.to_sym
dictionary[word.to_sym] = definition
else
raise “invalid line: #{line}”
end
end

p dictionary

Robert K. wrote in post #1054952:

dictionary = {}
File.foreach ‘C:/dictionary.txt’ do |line|
next if line.strip.empty? # skip empty lines
parts = line.split(’::’).map &:strip # split line at the double colon

That’s unsafe because it will break if there is a ‘::’ in the
definition.

It’s meant to break, because multiple double colons make the line
ambiguous (unless you make certain implications about the words and
definitions).

If you actually want to allow literal double colons in a safe way, you’d
have to implement some kind of escape syntax. But I don’t think that’s
necessary.

if %r{\A\s*(\w+)\s*::\s*+(.*?)\s*\z} =~ line
  yield $1, $2 # word, def

I find the \w+ pattern much too strict for words in a dictionary, since
they may include spaces, hyphens or even special characters like
umlauts.

Btw, I wouldn’t use symbols for the words because there are potentially
many and they are not GC’ed.

If the word list actually gets that big, you’re better off using a
database, anyway.

Jan, Robert,

Thanks so much. I really appreciate it.

Chris

Jan E. wrote in post #1054940:

Christopher D. wrote in post #1054934:

Hey, thanks for the reply. I’ve been practising my ruby but if you have
a JSON or XML solution I will look at / learn those too as Ruby is often
used in tandem with XML and JSON

So I’m looking to start up a text file where I can type words and then
its corresponding definition on the same line. For example.

In dictionary.txt I would have
word1 - definition1;
word2 - definition2;

If it’s only one word and one definition per line, then XML or JSON is
probably too much. But I wouldn’t use a hyphen as a separator, because
it may also appear in words or definitions. Use something special like a
double colon:

word :: definition

You don’t have to mark the end of the definition, because you can simply
read the whole line.

The “parser” could be something like this:

dictionary = {}
File.foreach ‘C:/dictionary.txt’ do |line|
next if line.strip.empty? # skip empty lines
parts = line.split(’::’).map &:strip # split line at the double colon

That’s unsafe because it will break if there is a ‘::’ in the
definition.

if parts.length == 2 and parts.none? &:empty?
word, definition = parts
raise “multiple definitions for #{word}” if
dictionary.has_key? word.to_sym
dictionary[word.to_sym] = definition
else
raise “invalid line: #{line}”
end
end

p dictionary

I would separate parsing and dictionary building. For example

def read_dict(file_name)
File.foreach file_name do |line|
line.chomp!

if %r{\A\s*(\w+)\s*::\s*+(.*?)\s*\z} =~ line
  yield $1, $2 # word, def
else
  raise "invalid line: #{line}"
end

end
end

def load_dict(file_name)
dict = {}

read_dict file_name do |word, definition|
raise “multiple definitions for #{word}” if
dict.has_key? word

dict[word] = definition

end

dict
end

Btw, I wouldn’t use symbols for the words because there are potentially
many and they are not GC’ed.

Kind regards

robert

Jan E. wrote in post #1054964:

Robert K. wrote in post #1054952:

dictionary = {}
File.foreach ‘C:/dictionary.txt’ do |line|
next if line.strip.empty? # skip empty lines
parts = line.split(’::’).map &:strip # split line at the double colon

That’s unsafe because it will break if there is a ‘::’ in the
definition.

It’s meant to break, because multiple double colons make the line
ambiguous (unless you make certain implications about the words and
definitions).

Since we are talking about “words” I deem it reasonable to assume they
do not contain “::”.

If you actually want to allow literal double colons in a safe way, you’d
have to implement some kind of escape syntax. But I don’t think that’s
necessary.

Exactly, because words do not contain “::”. :slight_smile: I think this is a
pretty low restriction and we can then use a slightly modified regexp by
substituting “\w+” with “\S.*?”:

%r{\A\s*(\S.?)\s::\s*+(.?)\s\z}

Now there needs to be at least one “::” in the line and the first “::”
on the line is the delimiter. No escaping needed. And the regexp does
the stripping already.

if %r{\A\s*(\w+)\s*::\s*+(.*?)\s*\z} =~ line
  yield $1, $2 # word, def

I find the \w+ pattern much too strict for words in a dictionary, since
they may include spaces, hyphens or even special characters like
umlauts.

Let’s ask the OP: Christopher, what is a “word” in your application? Do
word characters (\w) suffice or are you using a more broad definition of
“word”.

Btw, I wouldn’t use symbols for the words because there are potentially
many and they are not GC’ed.

If the word list actually gets that big, you’re better off using a
database, anyway.

Maybe. But since Christopher appears to start learning Ruby I think he
should start using Symbol and String properly right away which will
avoid hassle later on. Since symbols are never GC’ed most of the time
it is a bad idea IMHO to create them dynamically, i.e. from input.
Symbols are best used for a limited number of identifiers used
throughout the program (e.g. names of states). Everything which comes
from user input is usually better off being placed in Strings.

Kind regards

robert