Getting Page Titles for URLs with Ruby

Published April 21, 2022

Reading time: 4 minutes.

If you have a URL and you’d like to get the title for the page, you’ll need to fetch the URL, parse the source, and grab the text from the <title> tag. If you have a large number of URLs, you’ll want to automate this process.

You can do this with a few lines of Ruby and the Nokogiri library. In this tutorial you’ll create a small Ruby program to fetch the title for a web page. Then you’ll modify the program to work with an external file of URLs. Finally, you’ll make it more performant by using threads.

Fetching the Title from the URL

The Nokogiri library lets you use CSS selector syntax to grab nodes from HTML, and the URI library lets you quickly read a file from a URL like the cURL command does.

To do this, install the nokogiri and uri gems:

gem install uri
gem install nokogiri

Create a Ruby script called get_titles.rb and add the following code to load the libraries, open a URL as a file, send its contents to Nokogiri, and extract the value of the <title> tag:

require 'nokogiri'
require 'open-uri'

url = "https://google.com" 
URI.open(url) do |f|
  doc = Nokogiri::HTML(f)
  title = doc.at_css('title').text
  puts title
end

Save the file and run the program:

ruby get_titles.rb

The result shows the page title for Google:

Google

To do this for multiple URLs, put the URLs in an array manually, or get them from a file.

Reading URLs from a File

You may already have the list of URLs in a file, which may have come from a data export. Using Ruby’s File.readlines, you can quickly convert the file into an array.

Create a new file called links.txt and add a couple of links. Make sure one of them is a bad URL; you’ll make sure to handle errors.

https://google.com
https://devto

Save the file.

Now return to your get_titles.rb file and modify the code so it reads the file in line-by-line, and uses each line as a URL:

# get_titles.rb
require 'nokogiri'
require 'open-uri'

lines = File.readlines('links.txt')
lines.each do |line|
  url = line.chop
  URI.open(url) do |f|
    doc = Nokogiri::HTML(f)
    title = doc.at_css('title').text
    puts title
  end
rescue SocketError
  puts "#{url}: can't connect. Bad URL?"
end

Each line from the file will have a line break at the end, which you remove with the .chop method before storing the value in the url variable.

The URI.open method will throw a SocketError if it can’t connect, and so you rescue that error with a sensible message.

Save the file and run the program again:

ruby get_titles.rb

This time you see Google’s page title for the first URL, and the error message for the second:

Google
https://devto: Can't connect. Bad URL?

This version isn’t the fastest when your list gets large. On a file with 200 URLs, the process took 2 minutes. A lot of the time was the network latency. Each request takes some time to resolve and get the results.

Let’s make it faster.

Processing URLs Concurrently

To make this process more efficient, and much faster, you’ll need to use threads. And if you use threads, you’ll need to think about thread pooling because if you use too many threads you’ll run out of resources.

The concurrent-ruby gem makes this much less complex by giving you promises in Ruby, which have their own pooling mechanism.

Install the concurrent-ruby gem:

gem install concurrent-ruby

To use it, you’ll create a “job” for each line in the file. Each job is a promise which takes a block containing the code you want to execute. Then, following the loop, you collect all of the promises and call the value method, which blocks until the promise is complete. The pattern looks like this:

# Create a job for each line
jobs = lines.map do |line|
  Concurrent::Promises.future do
    # do the work
  end
end

# get all the jobs, blocking until they all finish.
Concurrent::Promises.zip(*jobs).value!

Modify the program to include the concurrent library and create a promise for each URL read. Then get the results:

# get_titles.rb
require 'nokogiri'
require 'open-uri'
require 'concurrent'

lines = File.readlines('links.txt')
jobs = lines.map do |line|
  Concurrent::Promises.future do
    url = line.chop

    URI.open(url) do |f|
      doc = Nokogiri::HTML(f)
      title = doc.at_css('title').text
      puts title
    end
  rescue SocketError
    puts "#{url}: can't connect. Bad URL?"
  end
end

Concurrent::Promises.zip(*jobs).value

In this version of the program, you’re printing the results to the screen. But you could return a value instead and print those.

This time, a file with 200 URLs took around 3 seconds to process. That’s a significant speed improvement and demonstrates why concurrent processing is important for these kinds of tasks.

Conclusion

In this tutorial you used Ruby to get page titles from URLs, and you then optimized it using the concurrent-ruby library to take advantage of threads and thread pooling.

To keep exploring, read in the data from a CSV file and use the program to generate a new CSV file with the URL and the title in separate columns.

Then see if you can pull additional information out of the URLs, such as the <meta> descriptions.


I don't have comments enabled on this site, but I'd love to talk with you about this article on Mastodon, Twitter, or LinkedIn. Follow me there and say hi.


Liked this? I have a newsletter.