Get Text Between Two Strings in Bash

Bash get text between two Strings

1. Overview

In scripting and programming, extracting specific portions of text from larger strings or files is a common task. In this article, we will see different ways to get text between two String using grep, awk, sed, and bash parameter expansion.

2. Introduction to Problem Statement

We are given a String, and we need to find text between two given Strings in Bash.
For example:

Input String: Start text[Extract this]End text
Output String: [Extract this]

Our goal is to find text between Start text and End text.

3. Using grep Command with -oP option

The grep is a useful command to search matching patterns in a file or input. It becomes powerful with -o and -P options.

The -o option is used to tell the grep to output only matched portion and the -P option is used to enable the Perl-compatible regular expressions (PCRE) for pattern matching.

Now let’s understand the regular expression:

  • (?>=Start text): This is a positive lookbehind assertion. It matches a position in the string preceded by the literal string "Start text", but it does not include "Start text" in the match.
  • .*?: This pattern matches any character (except a newline) zero or more times lazily. The ? makes the * quantifier lazy, matching as little text as possible.
  • (?=End text): This is a positive lookahead assertion. It matches a position in the string followed by the literal string "End text", but it does not include "End text" in the match.

In other words, this regular expression searched for the text that comes after "Start text" and before "End text" in the input string. Then, the command substitution syntax $() captured the command’s output inside and assigned it to the extracted_text variable, printed on the screen using the echo command.

4. Using sed Command

The sed (Stream Editor) is a powerful and versatile text processing tool that performs text transformations on an input stream (a file or input from a pipeline).

Let’s use sed to get text between two Strings.

Here, the -n option is used to suppress the automatic printing because, by default, sed prints each line of the input; that is why -n controls when to print.

Now let’s breakdown 's/Start text \(.*\) End text/\1/p' to understand what it means:

  • s: It represents the substitution command.
  • "Start text": It matched the literal string "Start text".
  • \(.*\): Uses parentheses to capture the text between "Start text" and "End text". The captured text is saved in a group.
  • End text: Matches the literal string " End text".
  • /\1/: This is the replacement part of the sed command. The \1 refers to the first captured group in the pattern. It is replaced with the captured text between "Start text" and "End text".
  • p: This specifies that the result should be printed.

Simply, the sed used the substitution operation s to match the given pattern and replaced it with the captured text enclosed in parentheses using the \1 backreference. Finally, the replaced line is printed due to the p flag.

5. Using awk Command

The awk is a powerful scripting language for text processing and is typically used for data extraction. The idea is to use awk with Start text and End text as delimiters and return the second column using {print $2}.

Here, -F is used to specify the field separator for awk. Each input line is divided into separate fields based on the occurrences of either "Start text" or "End text" as separators.

For the input string "Start text[Extract this]End text", the fields would be:

  • Field 1: ""
  • Field 2: "[Extract this]"
  • Field 3: ""

Note that the field separator pattern does not include the actual separators as part of the fields. It is used to define the boundaries for field splitting.

After that, {print $2} is used to print the field 2, which is the required text between two strings.

6. Using Bash Parameter Expansion

Bash parameter expansion offers string manipulation capabilities directly in the shell without calling external commands, which can be efficient for simple operations.

Let’s use Bash parameter expansion to achieve our goal.

The ${string#*Start text } removes the leading portion of the string up to the start boundary. Then, ${extracted_text%% End text*} removes the trailing portion of the string from the end boundary onwards. After that, the echo command displays the required text between two strings.

7. Conclusion

Extracting text between two strings in Bash can be achieved through various methods, each with its own advantages. grep with -oP is powerful for regex-based matching, sed excels in stream editing, awk is great for field-based text processing, and Bash parameter expansion offers a built-in solution.

Was this post helpful?

Leave a Reply

Your email address will not be published. Required fields are marked *