Extracting Link Information From Text

LinuxLinuxBeginner
Practice Now

Introduction

In this project, you will learn how to extract link information from Markdown documents using a Bash script. This is a common task in software development, where developers need to process and extract specific information from text-based documents.

👀 Preview

$ ./getlink.sh labex_lab1.md
course https://labex.io/courses/

ðŸŽŊ Tasks

In this project, you will learn:

  • How to create a Bash script to extract link text and URLs from a Markdown document
  • How to use regular expressions and command-line tools like grep and paste to process text data
  • How to make a script executable and run it with command-line arguments

🏆 Achievements

After completing this project, you will be able to:

  • Develop a Bash script that can extract link information from Markdown documents
  • Understand the logic and implementation of the script, including the use of regular expressions and common command-line tools
  • Apply the skills learned in this project to other text processing tasks in your software development work

Create the getlink.sh Script

In this step, you will create the getlink.sh script that can extract all the links from a Markdown document.

  1. Open a text editor and create a new file named getlink.sh.
  2. Add the following code to the file:
#!/bin/bash

## Extract link
grep -E "\[.*\]\(.+\)" "$1" | grep -vP '\!\[' | grep -oP '\[\K[^\]]+(?=\]\([^\)]+\))' > "links.txt"
grep -E "\[.*\]\(.+\)" "$1" | grep -vP '\!\[' | grep -oP '\]\(\K[^\)]+(?=\))' > "urls.txt"

## Merge links and URLs
paste -d ' ' links.txt urls.txt

## Clean up temporary files
rm links.txt urls.txt
  1. Save the file.

Test the getlink.sh Script

In this step, you will test the getlink.sh script by running it with a Markdown file as an argument.

  1. In the same directory as the getlink.sh script there is a Markdown file named labex_lab1.md. This file contains the following:
Use the course categories and tags on the [course](https://labex.io/courses/) page to filter and search for courses
  1. Run the getlink.sh script with the labex_lab1.md file as an argument:
./getlink.sh labex_lab1.md
  1. The script should output the following:
course https://labex.io/courses/

This output shows that the script has successfully extracted the link information from the Markdown file.

Understand the getlink.sh Script

In this step, you will understand the code in the getlink.sh script.

The script performs the following tasks:

  1. Extract link text: The first grep command extracts the link text from the Markdown file and saves it to a temporary file named links.txt. The grep -E "\[.*\]\(.+\)" command matches the Markdown link format [text](url), and the grep -vP '\!\[' command excludes image links.
  2. Extract link URLs: The second grep command extracts the link URLs from the Markdown file and saves them to a temporary file named urls.txt. The grep -oP '\]\(\K[^\)]+(?=\))' command captures the URL part of the Markdown link format.
  3. Merge link text and URLs: The paste -d ' ' links.txt urls.txt command merges the link text and URLs from the temporary files, separating them with a space.
  4. Clean up temporary files: The rm links.txt urls.txt command removes the temporary files created during the script's execution.

By understanding the script's logic, you can modify or extend it to suit your specific needs, such as handling different types of links or performing additional processing on the extracted information.

âœĻ Check Solution and Practice

Summary

Congratulations! You have completed this project. You can practice more labs in LabEx to improve your skills.

Other Linux Tutorials you may like