Taro Logo

Tag Validator

Hard
Asked by:
Profile picture
4 views
Topics:
StringsStacks

Given a string representing a code snippet, implement a tag validator to parse the code and return whether it is valid.

A code snippet is valid if all the following rules hold:

  1. The code must be wrapped in a valid closed tag. Otherwise, the code is invalid.
  2. A closed tag (not necessarily valid) has exactly the following format : <TAG_NAME>TAG_CONTENT</TAG_NAME>. Among them, <TAG_NAME> is the start tag, and </TAG_NAME> is the end tag. The TAG_NAME in start and end tags should be the same. A closed tag is valid if and only if the TAG_NAME and TAG_CONTENT are valid.
  3. A valid TAG_NAME only contain upper-case letters, and has length in range [1,9]. Otherwise, the TAG_NAME is invalid.
  4. A valid TAG_CONTENT may contain other valid closed tags, cdata and any characters (see note1) EXCEPT unmatched <, unmatched start and end tag, and unmatched or closed tags with invalid TAG_NAME. Otherwise, the TAG_CONTENT is invalid.
  5. A start tag is unmatched if no end tag exists with the same TAG_NAME, and vice versa. However, you also need to consider the issue of unbalanced when tags are nested.
  6. A < is unmatched if you cannot find a subsequent >. And when you find a < or </, all the subsequent characters until the next > should be parsed as TAG_NAME (not necessarily valid).
  7. The cdata has the following format : <![CDATA[CDATA_CONTENT]]>. The range of CDATA_CONTENT is defined as the characters between <![CDATA[ and the first subsequent ]]>.
  8. CDATA_CONTENT may contain any characters. The function of cdata is to forbid the validator to parse CDATA_CONTENT, so even it has some characters that can be parsed as tag (no matter valid or invalid), you should treat it as regular characters.

Example 1:

Input: code = "<DIV>This is the first line <![CDATA[<div>]]></DIV>"
Output: true
Explanation: 
The code is wrapped in a closed tag : <DIV> and </DIV>. 
The TAG_NAME is valid, the TAG_CONTENT consists of some characters and cdata. 
Although CDATA_CONTENT has an unmatched start tag with invalid TAG_NAME, it should be considered as plain text, not parsed as a tag.
So TAG_CONTENT is valid, and then the code is valid. Thus return true.

Example 2:

Input: code = "<DIV>>>  ![cdata[]] <![CDATA[<div>]>]]>]]>>]</DIV>"
Output: true
Explanation:
We first separate the code into : start_tag|tag_content|end_tag.
start_tag -> "<DIV>"
end_tag -> "</DIV>"
tag_content could also be separated into : text1|cdata|text2.
text1 -> ">>  ![cdata[]] "
cdata -> "<![CDATA[<div>]>]]>", where the CDATA_CONTENT is "<div>]>"
text2 -> "]]>>]"
The reason why start_tag is NOT "<DIV>>>" is because of the rule 6.
The reason why cdata is NOT "<![CDATA[<div>]>]]>]]>" is because of the rule 7.

Example 3:

Input: code = "<A>  <B> </A>   </B>"
Output: false
Explanation: Unbalanced. If "<A>" is closed, then "<B>" must be unmatched, and vice versa.

Constraints:

  • 1 <= code.length <= 500
  • code consists of English letters, digits, '<', '>', '/', '!', '[', ']', '.', and ' '.

Solution


Clarifying Questions

When you get asked this question in a real-life environment, it will often be ambiguous (especially at FAANG). Make sure to ask these questions in that case:

  1. What are the constraints on the tag names? Specifically, what characters are allowed, and what is the maximum length of a tag name?
  2. Can the XML code contain comments or processing instructions, and if so, should I ignore them or treat them as invalid?
  3. How should I handle nested CDATA sections, or CDATA sections that contain invalid XML characters?
  4. If the code is invalid, are there specific error codes or messages I should return, or is a simple `false` sufficient?
  5. Is the XML code guaranteed to be well-formed in terms of basic syntax (e.g., no unclosed angle brackets outside of tags and CDATA), or do I need to handle malformed inputs as well?

Brute Force Solution

Approach

We want to check if a piece of text that uses special tags is valid. The brute force way is to meticulously go through the text and try to match every opening tag with a closing tag.

Here's how the algorithm would work step-by-step:

  1. Start reading the text from the very beginning, character by character.
  2. If you see an opening tag, like '<TAG_NAME>', remember what that tag is.
  3. Keep reading, looking for the matching closing tag, like '</TAG_NAME>'.
  4. If you find a matching closing tag, mark that section of the text as valid and continue checking the rest of the text.
  5. If you don't find a matching closing tag, the text is invalid.
  6. If you find a closing tag before finding its matching opening tag, the text is invalid.
  7. Check that all the tags are properly nested within each other; meaning inner tags must be closed before their outer tags.
  8. If you reach the end of the text and everything matches up correctly, the text is valid.

Code Implementation

def is_valid_tag(code):
    code_length = len(code)
    index = 0
    tags = []

    if code_length == 0:
        return False

    while index < code_length:
        if code[index] == '<':
            if index + 1 < code_length and code[index + 1] == '!':
                #Handle CDATA section
                if code[index + 2:index + 9] == '[CDATA[':
                    close_cdata_index = code.find(']]>', index + 9)
                    if close_cdata_index == -1:
                        return False
                    index = close_cdata_index + 3
                else:
                    return False
            elif index + 1 < code_length and code[index + 1] == '/':
                #Closing tag
                end_index = code.find('>', index + 2)
                if end_index == -1:
                    return False
                tag_name = code[index + 2:end_index]

                if not tags:
                    return False

                if tags[-1] == tag_name:
                    tags.pop()
                else:
                    return False
                index = end_index + 1
            else:
                #Opening tag
                end_index = code.find('>', index + 1)
                if end_index == -1:
                    return False
                tag_name = code[index + 1:end_index]

                # Tag name validation
                if not (1 <= len(tag_name) <= 9 and tag_name.isupper()):
                    return False

                tags.append(tag_name)
                index = end_index + 1
        else:
            #This represents text content. 
            index += 1

    #Ensure no tags remain open
    if tags:
        return False
    
    #Need to have at least one tag
    first_tag_end = code.find('>')
    if first_tag_end == -1:
        return False

    # Must have at least one valid tag
    last_tag_start = code.rfind('<')
    if last_tag_start == -1:
        return False

    return True

Big(O) Analysis

Time Complexity
O(n^2)The algorithm iterates through the input string of length n, potentially multiple times. In the worst-case scenario, for each character representing the start of a tag, the algorithm might have to scan a significant portion of the remaining string to find its corresponding closing tag. This nested scanning, where for each of the n characters we might perform a search up to n characters, leads to a time complexity proportional to n * n. Therefore, the overall time complexity of this approach is O(n^2).
Space Complexity
O(N)The algorithm uses a stack to keep track of the opening tags. In the worst-case scenario, where we have nested tags without corresponding closing tags until the very end, the stack can grow up to the number of opening tags present in the input text. Therefore, the auxiliary space used by the stack is proportional to the number of tags, which in the worst case, could be proportional to the size of the input N, where N is the length of the input text.

Optimal Solution

Approach

The challenge is to check if a piece of text, marked up with tags, is valid HTML-like code. The efficient way is to go through the text piece by piece, keeping track of what tags are currently open, and making sure everything closes in the right order and follows the rules.

Here's how the algorithm would work step-by-step:

  1. Read the code character by character.
  2. If you find an opening tag, check if its name is valid (uppercase letters only and between 1-9 characters).
  3. When you see an opening tag, remember that it's now open.
  4. If you find a closing tag, check if its name matches the last opened tag.
  5. If it matches, mark that the tag is now closed.
  6. If it doesn't match or if a tag closes without being opened first, the code is invalid.
  7. Also, make sure that tags don't overlap incorrectly.
  8. If you find content (text between tags), make sure you are already inside an open tag.
  9. After reading the whole code, check if all opened tags have been closed. If not, the code is invalid.
  10. The code must be inside one valid tag to be considered valid, if empty or contains only comments the code is invalid.

Code Implementation

def tag_validator(code):
    code = code.replace('<![CDATA[', ' ').replace(']]>', ' ')

    if not code:
        return True

    if not code.startswith('<') or not code.endswith('>'):
        return False

    stack_of_tags = []
    index = 0

    while index < len(code):
        if code[index:index + 2] == '</':
            end_index = code.find('>', index + 2)
            if end_index == -1:
                return False
            tag_name = code[index + 2:end_index]

            if not stack_of_tags or stack_of_tags[-1] != tag_name:
                return False
            stack_of_tags.pop()
            index = end_index + 1

        elif code[index] == '<':
            if code[index + 1] == '!':
                return False
            end_index = code.find('>', index + 1)
            if end_index == -1:
                return False
            tag_name = code[index + 1:end_index]

            if not (1 <= len(tag_name) <= 9 and tag_name.isupper()):
                return False

            stack_of_tags.append(tag_name)
            index = end_index + 1

        else:
            # Text content; move to next tag.
            index += 1

    # Stack must be empty for valid tag structure.
    return not stack_of_tags

Big(O) Analysis

Time Complexity
O(n)The algorithm iterates through the input string of length n character by character. Within the loop, tag validation operations (checking tag name validity, matching opening and closing tags) perform constant time operations. The stack operations (pushing and popping tags) also take constant time. Therefore, the overall time complexity is dominated by the single pass through the string, resulting in O(n).
Space Complexity
O(N)The algorithm uses a stack to keep track of open tags. In the worst-case scenario, where there are nested opening tags without corresponding closing tags until the end of the input, the stack's size can grow linearly with the length of the input string. Therefore, the auxiliary space used by the stack can be proportional to N, where N is the length of the input string representing the tagged text. Other variables used take constant space, but the stack dominates the space complexity.

Edge Cases

Empty input string
How to Handle:
Return false immediately as an empty string cannot represent a valid XML document with a root tag.
String with only whitespace
How to Handle:
Return false as whitespace alone doesn't form a valid XML structure.
Unclosed root tag (e.g., <ABC>)
How to Handle:
The stack should not be empty at the end, indicating an unclosed root tag, and should return false.
Unmatched closing tag (e.g., </ABC>)
How to Handle:
The stack should not encounter a closing tag that doesn't match the top of the stack, indicating a tag mismatch, and should return false.
Invalid tag name (e.g., <123>)
How to Handle:
Tag name validation should check for invalid characters or leading digits and return false if found.
Nested CDATA sections
How to Handle:
The CDATA parsing logic should ensure nested CDATA sections are not allowed and return false if detected.
Code containing only CDATA
How to Handle:
The code should be invalidated because a root tag is required, even if CDATA is present.
Tag with characters after the closing tag
How to Handle:
If any characters exist after the closing root tag, the solution should return false.