LinuxDevCenter.com

oreilly.comSafari Books Online.Conferences.

We've expanded our Linux news coverage and improved our search! Search for all things Linux across O'Reilly!

Search
Search Tips

advertisement

Listen Print Discuss Subscribe to Linux Subscribe to Newsletters

Building Unix Tools with Ruby
Pages: 1, 2, 3

Get the Plumbing Right

With option parsing code in place, you are now ready to add code for processing CSV files and for making your script behave like a proper command line tool.



It is an old Unix tradition that commands can be piped together to create more complex tools. Your script should obey that convention; doing so will make it more flexible and allow other users do things the authors of the software have never dreamed of.

Writing a Ruby script that fits into that scheme is actually very simple. The simplest piece of code that copies everything from STDIN to STDOUT is just three lines long:

while gets
    print 
end

Add it at the end of your script and see how it works. You do not need to worry about the way data is sent to your script. Both examples shown below give the same results, all without writing additional code.

$ cat file1 file2 | csvt -e 2,0
$ csvt -e 2,0 file1 file2

Processing Input

The simple loop shown in Section 6 is not very useful, because it it does not do any processing of input. It does illustrate the general concept. The csvt script will use two such loops, one for --extract and one for --remove. Both start with a test of the appropriate flag, extract_f for --extract and remove_f for --remove.

if extract_f == true
     first_f = true

The first_f flag is used to avoid the "off by one" error inside the while loop:

while gets
        data   = $_.chop
        data   = data.split(",")
        data_n = data.length

Every loop cycle starts with a call to gets, which reads a new line from STDIN and stores it in $_. Next the script removes the end of line character and splits the line into an array of separate columns.

        if first_f
            old_data_n = data_n
               first_f = false
        end

The size of the array is stored in data_n. Then it tests if the line just read was the first line and sets the number of columns on the non-existent previous line to the number of columns on the first line to pass the data integrity check (comparing the number of columns in the previous and the current line).

        if data_n != old_data_n
            $stderr.print "csvt: the number of fields on the "
                        + "following line does not match the number "
                        + "of fields on the previous line\n"
            $stderr.print $_
            exit(1)
        end

Should the data integrity test fail, the error message followed by the offending line will be printed to the system log and the execution of csvt will stop. It is tempting to relax the rules a little and introduce an option for skipping such errors, but that's a job for a separate tool; namely, a specialized data integrity checker, which is usually written with a particular data set in mind and therefore outside the scope of the csvt's specification.

When everything goes well, we can begin constructing a line of output. This starts with initializing the line variable:

line = ""

Next we travel the array of arguments for the --extract option. As you will notice, there is test check, if the column index is less than the number of fields in the line we just read. If it is, csvt will complain, suggest the allowed range of indexes and exit with code 1.

        extract_args.each do |column|
 
            if !(column < data_n)
                $stderr.print "csvt: column index out of range, "
                            + "use numbers between 0 and ", 
                              data_n - 1, "\n"
                exit(1)
            end

If all goes well, we use the value of column as the index into the data array and add the result to the string stored in line, followed by a comma.

            line += data[column] + ","
        end

Once all columns listed as arguments of --extract have been processed, we can print the contents of the line variable, less the last character, which we replace with the end of line character.

print line[0, line.length-1], "\n"

The last thing is setting the old_data_n variable to the number of columns in the currently processed line, so the data integrity check can spot any errors.

        old_data_n = data_n
    end
end

So it goes until the end of the file or data stream. When all data is processed, our script ends with a call to exit(0).

The code used to process STDIN when the user chooses the --remove option is similar to the --extract handler, with a small twist after the line variable initialization.

if remove_f == true
    first_f = true

    while gets
        data   = $_.chop
        data   = data.split(",")
        data_n = data.length

        if first_f
            old_data_n = data_n
               first_f = false
        end

        if data_n != old_data_n
            $stderr.print "csvt: the number of fields on the following "
                        + "line does not match the number of fields on "
                        + "the previous line\n"
            $stderr.print $_
            exit(1)
        end

        line = ""

There is an additional loop that sets the columns whose indexes are listed as arguments of --remove to "".

        remove_args.each do |column|

            if !(column < data_n)
                $stderr.print "csvt: field index out of range, "
                            + "use numbers between 0 and ", 
                              data_nf - 1, "\n"
                exit(1)
            end

            data[column] = ""
        end

The rest of the code is identical to the code in the --extract handler.

        data.each do |column|
            if column == ""
                next
            else
                line += column + ","
            end
        end

        print line[0, line.length-1], "\n"

        old_data_n = data_n
    end
end

We now have a complete script to help us filter CSV files. It may grow in the future, but for now it is quite complete. Your script plays well with other command-line Unix tools and is a well behaved Unix citizen. The complete script is here.

Make csvt Executable

Your script is working now and you could call it quits, but for greater convenience in the future, try to make an extra effort and make csvt executable, so you can type just this:

$ csvt

instead of this:

$ ruby csvt.rb

If you are using Unix, simply add this code on the first line of your script:

#!/usr/local/bin/ruby

The actual path to the ruby interpreter binary might be different on your system. The easiest way to find out is to use the locate or which command:

$ locate ruby
$ which ruby

If either fails, use find

$ find / -name "ruby"

This might take a while because find is searching the whole directory tree. Once you know the access path to the ruby binary, paste it after #! and save the script to disk. Remember that you need place these instructions on the very first line of your script or the shell will not be able to recognize it as a request to use the Ruby interpreter. If you need to list options for the interpreter, you can list them, but remember that there is no need to list the name of the script itself.

Now save csvt to disk, and make it executable with $ chmod u+x csvt.

The u+x argument tells chmod to mark csvt as executable only by the owner of the script (that would be you ...). Other possibilities include g+x, which marks the script as executable by all members of the group that the script is assigned to (ls -l reveals the script's group); o+x, which would make the script executable by all other users (not a good idea); finally, a+x would make it executable by all users (this should be avoided as well).

Note that neither the #! notation nor chmod command can be used in the Microsoft Windows environment unless you install the Cygwin package, which turns Windows into a pretty good Unix environment look-and-feel-alike. When installing Cygwin is not an option, you can still use csvt, but it must be preceded with the ruby command, as in ruby csvt -e file instead of csvt -e file.

Resources

The following places should be on the list of favorite destinations for everyone learning and using Ruby:

  • Ruby binaries and sources
  • Ruby mailing lists
  • the Ruby newsgroup
  • the Cygwin Unix environment for Microsoft Windows
  • the Fink Unix environment for Mac OS X (the latest Ruby builds for Mac OS X)

Books

If you want to enhance your knowledge of Ruby, you should take a look at Ruby in a Nutshell from O'Reilly or Programming Ruby from Addison-Wesley. Safari has at least half a dozen Ruby titles, from O'Reilly as well as other publishers.

Jacek Artymiak started his adventure with computers in 1986 with Sinclair ZX Spectrum. He's been using various commercial and Open Source Unix systems since 1991. Today, Jacek runs devGuide.net, writes and teaches about Open Source software and security, and tries to make things happen.


Return to ONLamp.com.


Have a question about Ruby or the Unix CLI philosophy? Ask Jacek here.
You must be logged in to the O'Reilly Network to post a talkback.
Post Comment
Full Threads Oldest First

Showing messages 1 through 12 of 12.

  • Ruby CSV class
    2006-10-13 16:27:54  shadowbq [Reply | View]

    4/10 Agree
    You could of done alot better with using standard libraries.. such as csv.rb
    Maybe do an array of hashes that can use sort_by?


    require 'csv'
    CSV.open('csvfile.csv', 'r') do |row|
    p row
    end


    And the IO class for reading.. come on.

    docs:
    http://www.ruby-doc.org/stdlib/libdoc/csv/rdoc/index.html
    http://www.ruby-doc.org/core/classes/IO.html
  • from String to Integer
    2004-07-13 03:41:08  Mailinator [Reply | View]

    On the second page of the article, the author converts the elements of the array named extract_args (or remove_args) from Strings to Integerss. You can do it in a Rubier way with the Array#collect method, like:

    when '--extract'
    extract_f = true
    begin
    extract_args = arg.split(',').collect { |column| column.to_i }
    rescue
    # ...

    And you can do the same with the when '--remove' clause of the switch.
  • Why executable only to yourself?
    2003-09-27 10:29:13  anonymous2 [Reply | View]

    In this program there's nothing sensitive happening, so I don't see why the code is marked executable to yourself only.

    As a matter of sharing tools it seems like a good idea, to me, to place scripts like this in /usr/local/bin so that all users may share them.

    Unless there are hardcoded passwords or things that you wish to keep private I failt to see why making scripts 755 is bad in general which the author seems to suggest.

    (As for the non-ideomatic Ruby code, I'm not going to cmment on that, as others already have. It's good to see code like this to show newcomers that Ruby can be written however you like).
  • don't forget "puts"!
    2003-09-21 04:21:21  dblack [Reply | View]

    Don't forget the "puts" method. In Ruby, you don't have to do all that

    print str, "\n"

    or

    print "string\n"

    stuff. Using puts is shorter and cleaner, and will help deflect some of the criticisms about writing Ruby in a Perl style :-)

  • "It's code like this that causes unrest."
    2003-09-21 02:12:23  anonymous2 [Reply | View]

    I have to agree with the other comments that the code is very un-Ruby-like. Seems like someone who's only recently come from Perl and is trying to write Ruby in a Perl style.

    "while gets {}" for example. What nonsense. "$stdin.readlines.each {}" is much more Rubyish.

    The reinvention of the command-line and CSV parsing wheels is also something that tends to afflict Perl people, not to mention the fact that the CSV parsing is inadequate and will fail with embedded commas.

    Overall, 4/10 for effort.

    • "more Rubyish" isn't relevant in this case
      2003-09-21 05:13:39  dblack [Reply | View]

      > "while gets {}" for example. What nonsense.
      > "$stdin.readlines.each {}" is much more Rubyish.

      These two things are not equivalent. gets returns one line. readlines reads all the lines into an array. There may be cases when these two techniques appear to do the same thing, but in fact they don't. If you try them with keyboard input, you'll see what I mean.

      Since they are different, determining which one is "more Rubyish" is secondary to deciding which one you actually want to do. Moreover, if one needs the gets version, one can then do:

      while (line = gets)

      to get rid of $_. (You'll find this technique used in several places in the Ruby distribution.)


      • "more Rubyish" isn't relevant in this case
        2003-09-23 08:55:57  anonymous2 [Reply | View]

        I think $stdin.each_line {} is better and more "rubyish" than any of the above.
  • very un-Rubyish
    2003-09-21 02:01:16  anonymous2 [Reply | View]

    $_? I'm relatively new to Ruby and the article IS nice sadly the code isn't the $'s are rarely used, and the begin keyword haven't see it that often, OTHO the OptionParser library would have been more appropiate IMHO, besides that, is not working on my win2k box!!(guess I'll have to make my own) My conclusion: nice try, the code is Ruby but not the Ruby *way*. Recommended tabsize in Ruby=2 spaces.
  • O'Reilly technical review?
    2003-09-20 21:48:16  anonymous2 [Reply | View]

    Did O'Reilly consult with any Ruby experts before publishing this article, to make sure they weren't publishing crap? Here's an opinion from comp.lang.ruby:

    I haven't read it thoroughly, but I noticed that most of the code was
    horribly un-Rubyish. It could be made a lot nicer. If I had time,
    I'd rewrite it all and submit it to the author!
  • Shebang line
    2003-09-20 10:14:47  anonymous2 [Reply | View]

    Hint: Instead of hard-coding the ruby path, use

    #!/usr/bin/env ruby

    The ruby binary will be searched automatically then.
  • good but..
    2003-09-19 16:55:01  anonymous2 [Reply | View]

    this is a nice article!
    anyay, as a really nuby to ruby, I think you're coding too much in perl stile.
    I think most of the ruby gurus would avoid $_ using instead an explicit variable.
    Anyway, interesting stuff :)
  • Option parsing
    2003-09-19 11:42:31  anonymous2 [Reply | View]

    You spend an awful lot of time talking about parsing options here. Why not just use Ruby 1.8's optparse library? Should shrink your code dramatically.


Tagged Articles

Post to del.icio.us

This article has been tagged:

ruby

Articles that share the tag ruby:

Rolling with Ruby on Rails (2837 tags)

Ajax on Rails (525 tags)

Rolling with Ruby on Rails, Part 2 (241 tags)

The Builders of Basecamp (54 tags)

Technologies to Watch: A Look at Four That May Challenge Java’s Development Dominance (47 tags)

View All

Sponsored Resources

  • Inside Lightroom
Advertisement

Sponsored by:

O'Reilly Media

©2009, O'Reilly Media, Inc.
(707) 827-7000 / (800) 998-9938
All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners.
About O'Reilly
Academic Solutions
Authors
Contacts
Customer Service
Jobs
Newsletters
O'Reilly Labs
Press Room
Privacy Policy
RSS Feeds
Terms of Service
User Groups
Writing for O'Reilly
Content Archive
Business Technology
Computer Technology
Google
Microsoft
Mobile
Network
Operating System
Digital Photography
Programming
Software
Web
Web Design
More O'Reilly Sites
O'Reilly Radar
Ignite
Tools of Change for Publishing
Digital Media
Inside iPhone
O'Reilly FYI
makezine.com
craftzine.com
hackszine.com
perl.com
xml.com

Partner Sites
InsideRIA
java.net
O'Reilly Insights on Forbes.com