Retrieving Webpages Using wget, curl and lynx¶
Whether you are an IT professional who needs to download 2000 online bug
reports into a flat text file and parse them to see which ones need
attention, or a mum who wants to download 20 recipes from an public domain
website, you can benefit from knowing the tools which help you download
webpages into a text based file. If you are interested in learning more
about how to parse the pages you download, you can have a look at our Big
Data Manipulation for Fun and Profit Part 1 article.
In this tutorial you will learn:
* How to retrieve/download webpages using wget, curl and lynx
* What the main differences between the wget, curl and lynx tools are
* Examples showing how to use wget, curl and lynx
Retrieving Webpages Using wget, curl and lynx
Retrieving Webpages Using wget, curl and lynx
Software requirements and conventions used
Software Requirements and Linux Command Line Conventions
Category Requirements, Conventions or Software Version Used
System Linux Distribution-independent
Software Bash command line, Linux based system
Any utility which is not included in the Bash shell by default
Other can be installed using sudo apt-get install utility-name (or
yum install for RedHat based systems)
# – requires linux-commands to be executed with root
privileges either directly as a root user or by use of sudo
Conventions command
$ – requires linux-commands to be executed as a regular
non-privileged user
Before we start, please install the 3 utilities using the following
command (on Ubuntu or Mint), or use yum install instead of apt install if
you are using a RedHat based Linux distribution.
$ sudo apt-get install wget curl lynx
Once done, let’s get started!
Example 1: wget
Using wget to retrieve a page is easy and straightforward:
$ wget https://linuxconfig.org/linux-complex-bash-one-liner-examples
--2020-10-03 15:30:12-- https://linuxconfig.org/linux-complex-bash-one-liner-examples
Resolving linuxconfig.org (linuxconfig.org)... 2606:4700:20::681a:20d, 2606:4700:20::681a:30d, 2606:4700:20::ac43:4b67, ...
Connecting to linuxconfig.org (linuxconfig.org)|2606:4700:20::681a:20d|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: 'linux-complex-bash-one-liner-examples’
linux-complex-bash-one-liner-examples [ <=> ] 51.98K --.-KB/s in 0.005s
2020-10-03 15:30:12 (9.90 MB/s) - 'linux-complex-bash-one-liner-examples’ saved [53229]
$
Here we downloaded an article from linuxconfig.org into a file, which by
default is named the same as the name in the URL.
Let’s check out the file contents
$ file linux-complex-bash-one-liner-examples
linux-complex-bash-one-liner-examples: HTML document, ASCII text, with very long lines, with CRLF, CR, LF line terminators
$ head -n5 linux-complex-bash-one-liner-examples
```html
<!DOCTYPE html>
<html lang="en-gb" dir="ltr">
<head>
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<meta http-equiv="X-UA-Compatible" content="IE=edge" />
Example 2: curl
$ curl https://linuxconfig.org/linux-complex-bash-one-liner-examples > linux-complex-bash-one-liner-examples % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 53045 0 53045 0 0 84601 0 --:--:-- --:--:-- --:--:-- 84466 $
This time we used curl to do the same as in our first example. By default, curl will output to standard out (stdout) and display the HTML page in your terminal! Thus, we instead redirect (using >) to the file linux-complex-bash-one-liner-examples.
We again confirm the contents:
$ file linux-complex-bash-one-liner-examples linux-complex-bash-one-liner-examples: HTML document, ASCII text, with very long lines, with CRLF, CR, LF line terminators ``` head -n5 linux-complex-bash-one-liner-examples
```html
<!DOCTYPE html>
<html lang="en-gb" dir="ltr">
<head>
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<meta http-equiv="X-UA-Compatible" content="IE=edge" />
Great, the same result!
One challenge, when we want to process this/these file(s) further, is that the format is HTML based. We could parse the output by using sed or awk and some semi-complex regular expression, to reduce the output to text-only but doing so is somewhat complex and often not sufficiently error-proof. Instead, let’s use a tool which was natively enabled/programmed to dump pages into text format.
Example 3: lynx
Lynx is another tool which we can use to retrieve the same page. However, unlike wget and curl, lynx is meant to be a full (text-based) browser. Thus, if we output from lynx, the output will be text, and not HTML, based. We can use the lynx -dump command to output the webpage being accessed, instead of starting a fully interactive (test-based) browser in your Linux client.
$ lynx -dump https://linuxconfig.org/linux-complex-bash-one-liner-examples > linux-complex-bash-one-liner-examples $
Let’s examine the contents of the created file once more:
$ file linux-complex-bash-one-liner-examples linux-complex-bash-one-liner-examples: UTF-8 Unicode text $ head -n5 linux-complex-bash-one-liner-examples * [1]Ubuntu + o [2]Back o [3]Ubuntu 20.04 o [4]Ubuntu 18.04
As you can see, this time we have a UTF-8 Unicode text based file, unlike the previous wget and curl examples, and the head command confirms that the first 5 lines are text based (with references to the URL’s in the form of [nr] markers). We can see the URL’s towards the end of the file:
$ tail -n86 linux-complex-bash-one-liner-examples | head -n3 Visible links 1. https://linuxconfig.org/ubuntu 2. https://linuxconfig.org/linux-complex-bash-one-liner-examples
Retrieving pages in this way provides us with a great benefit of having HTML-free text-based files which we can use to process further if so required.