Extracting Link Information From Text

LinuxLinuxBeginner
Practice Now

Introduction

In this project, you will learn how to extract link information from Markdown documents using a Bash script. This is a common task in software development, where developers need to process and extract specific information from text-based documents.

👀 Preview

$ ./getlink.sh labex_lab1.md
course https://labex.io/courses/

🎯 Tasks

In this project, you will learn:

  • How to create a Bash script to extract link text and URLs from a Markdown document
  • How to use regular expressions and command-line tools like grep and paste to process text data
  • How to make a script executable and run it with command-line arguments

🏆 Achievements

After completing this project, you will be able to:

  • Develop a Bash script that can extract link information from Markdown documents
  • Understand the logic and implementation of the script, including the use of regular expressions and common command-line tools
  • Apply the skills learned in this project to other text processing tasks in your software development work

Skills Graph

%%%%{init: {'theme':'neutral'}}%%%% flowchart RL linux(("`Linux`")) -.-> linux/BasicFileOperationsGroup(["`Basic File Operations`"]) linux(("`Linux`")) -.-> linux/BasicSystemCommandsGroup(["`Basic System Commands`"]) linux(("`Linux`")) -.-> linux/InputandOutputRedirectionGroup(["`Input and Output Redirection`"]) linux(("`Linux`")) -.-> linux/TextProcessingGroup(["`Text Processing`"]) shell(("`Shell`")) -.-> shell/BasicSyntaxandStructureGroup(["`Basic Syntax and Structure`"]) shell(("`Shell`")) -.-> shell/ControlFlowGroup(["`Control Flow`"]) shell(("`Shell`")) -.-> shell/AdvancedScriptingConceptsGroup(["`Advanced Scripting Concepts`"]) shell(("`Shell`")) -.-> shell/SystemInteractionandConfigurationGroup(["`System Interaction and Configuration`"]) linux(("`Linux`")) -.-> linux/FileandDirectoryManagementGroup(["`File and Directory Management`"]) linux/BasicFileOperationsGroup -.-> linux/cut("`Text Cutting`") linux/BasicSystemCommandsGroup -.-> linux/echo("`Text Display`") linux/InputandOutputRedirectionGroup -.-> linux/pipeline("`Data Piping`") linux/InputandOutputRedirectionGroup -.-> linux/redirect("`I/O Redirecting`") linux/TextProcessingGroup -.-> linux/grep("`Pattern Searching`") linux/TextProcessingGroup -.-> linux/awk("`Text Processing`") linux/TextProcessingGroup -.-> linux/paste("`Line Merging`") linux/BasicFileOperationsGroup -.-> linux/rm("`File Removing`") shell/BasicSyntaxandStructureGroup -.-> shell/shebang("`Shebang`") shell/BasicSyntaxandStructureGroup -.-> shell/comments("`Comments`") shell/BasicSyntaxandStructureGroup -.-> shell/quoting("`Quoting Mechanisms`") shell/ControlFlowGroup -.-> shell/cond_expr("`Conditional Expressions`") shell/AdvancedScriptingConceptsGroup -.-> shell/subshells("`Subshells and Command Groups`") shell/AdvancedScriptingConceptsGroup -.-> shell/adv_redirection("`Advanced Redirection`") shell/SystemInteractionandConfigurationGroup -.-> shell/globbing_expansion("`Globbing and Pathname Expansion`") linux/FileandDirectoryManagementGroup -.-> linux/wildcard("`Wildcard Character`") subgraph Lab Skills linux/cut -.-> lab-301471{{"`Extracting Link Information From Text`"}} linux/echo -.-> lab-301471{{"`Extracting Link Information From Text`"}} linux/pipeline -.-> lab-301471{{"`Extracting Link Information From Text`"}} linux/redirect -.-> lab-301471{{"`Extracting Link Information From Text`"}} linux/grep -.-> lab-301471{{"`Extracting Link Information From Text`"}} linux/awk -.-> lab-301471{{"`Extracting Link Information From Text`"}} linux/paste -.-> lab-301471{{"`Extracting Link Information From Text`"}} linux/rm -.-> lab-301471{{"`Extracting Link Information From Text`"}} shell/shebang -.-> lab-301471{{"`Extracting Link Information From Text`"}} shell/comments -.-> lab-301471{{"`Extracting Link Information From Text`"}} shell/quoting -.-> lab-301471{{"`Extracting Link Information From Text`"}} shell/cond_expr -.-> lab-301471{{"`Extracting Link Information From Text`"}} shell/subshells -.-> lab-301471{{"`Extracting Link Information From Text`"}} shell/adv_redirection -.-> lab-301471{{"`Extracting Link Information From Text`"}} shell/globbing_expansion -.-> lab-301471{{"`Extracting Link Information From Text`"}} linux/wildcard -.-> lab-301471{{"`Extracting Link Information From Text`"}} end

Create the getlink.sh Script

In this step, you will create the getlink.sh script that can extract all the links from a Markdown document.

  1. Open a text editor and create a new file named getlink.sh.
  2. Add the following code to the file:
#!/bin/bash

## Extract link
grep -E "\[.*\]\(.+\)" "$1" | grep -vP '\!\[' | grep -oP '\[\K[^\]]+(?=\]\([^\)]+\))' > "links.txt"
grep -E "\[.*\]\(.+\)" "$1" | grep -vP '\!\[' | grep -oP '\]\(\K[^\)]+(?=\))' > "urls.txt"

## Merge links and URLs
paste -d ' ' links.txt urls.txt

## Clean up temporary files
rm links.txt urls.txt
  1. Save the file.

Test the getlink.sh Script

In this step, you will test the getlink.sh script by running it with a Markdown file as an argument.

  1. In the same directory as the getlink.sh script there is a Markdown file named labex_lab1.md. This file contains the following:
Use the course categories and tags on the [course](https://labex.io/courses/) page to filter and search for courses
  1. Run the getlink.sh script with the labex_lab1.md file as an argument:
./getlink.sh labex_lab1.md
  1. The script should output the following:
course https://labex.io/courses/

This output shows that the script has successfully extracted the link information from the Markdown file.

Understand the getlink.sh Script

In this step, you will understand the code in the getlink.sh script.

The script performs the following tasks:

  1. Extract link text: The first grep command extracts the link text from the Markdown file and saves it to a temporary file named links.txt. The grep -E "\[.*\]\(.+\)" command matches the Markdown link format [text](url), and the grep -vP '\!\[' command excludes image links.
  2. Extract link URLs: The second grep command extracts the link URLs from the Markdown file and saves them to a temporary file named urls.txt. The grep -oP '\]\(\K[^\)]+(?=\))' command captures the URL part of the Markdown link format.
  3. Merge link text and URLs: The paste -d ' ' links.txt urls.txt command merges the link text and URLs from the temporary files, separating them with a space.
  4. Clean up temporary files: The rm links.txt urls.txt command removes the temporary files created during the script's execution.

By understanding the script's logic, you can modify or extend it to suit your specific needs, such as handling different types of links or performing additional processing on the extracted information.

Summary

Congratulations! You have completed this project. You can practice more labs in LabEx to improve your skills.

Other Linux Tutorials you may like