Edit on GitHub

pdoc.markdown2

A fast and complete Python implementation of Markdown.

[from http://daringfireball.net/projects/markdown/]

Markdown is a text-to-HTML filter; it translates an easy-to-read / easy-to-write structured text format into HTML. Markdown's text format is most similar to that of plain text email, and supports features such as headers, emphasis, code blocks, blockquotes, and links.

Markdown's syntax is designed not as a generic markup language, but specifically to serve as a front-end to (X)HTML. You can use span-level HTML tags anywhere in a Markdown document, and you can use block level HTML tags (like

and as well).

Module usage:

>>> import markdown2
>>> markdown2.markdown("*boo!*")  # or use `html = markdown_path(PATH)`
u'<p><em>boo!</em></p>\n'

>>> markdowner = Markdown()
>>> markdowner.convert("*boo!*")
u'<p><em>boo!</em></p>\n'
>>> markdowner.convert("**boom!**")
u'<p><strong>boom!</strong></p>\n'

This implementation of Markdown implements the full "core" syntax plus a number of extras (e.g., code syntax coloring, footnotes) as described on https://github.com/trentm/python-markdown2/wiki/Extras.

   1# fmt: off
   2# flake8: noqa
   3# type: ignore
   4# Taken from here: https://github.com/trentm/python-markdown2/blob/ac5e7b956e9b8bc952039bfecb158ef1ddd7d422
   5
   6#!/usr/bin/env python
   7# Copyright (c) 2012 Trent Mick.
   8# Copyright (c) 2007-2008 ActiveState Corp.
   9# License: MIT (http://www.opensource.org/licenses/mit-license.php)
  10
  11r"""A fast and complete Python implementation of Markdown.
  12
  13[from http://daringfireball.net/projects/markdown/]
  14> Markdown is a text-to-HTML filter; it translates an easy-to-read /
  15> easy-to-write structured text format into HTML.  Markdown's text
  16> format is most similar to that of plain text email, and supports
  17> features such as headers, *emphasis*, code blocks, blockquotes, and
  18> links.
  19>
  20> Markdown's syntax is designed not as a generic markup language, but
  21> specifically to serve as a front-end to (X)HTML. You can use span-level
  22> HTML tags anywhere in a Markdown document, and you can use block level
  23> HTML tags (like <div> and <table> as well).
  24
  25Module usage:
  26
  27    >>> import markdown2
  28    >>> markdown2.markdown("*boo!*")  # or use `html = markdown_path(PATH)`
  29    u'<p><em>boo!</em></p>\n'
  30
  31    >>> markdowner = Markdown()
  32    >>> markdowner.convert("*boo!*")
  33    u'<p><em>boo!</em></p>\n'
  34    >>> markdowner.convert("**boom!**")
  35    u'<p><strong>boom!</strong></p>\n'
  36
  37This implementation of Markdown implements the full "core" syntax plus a
  38number of extras (e.g., code syntax coloring, footnotes) as described on
  39<https://github.com/trentm/python-markdown2/wiki/Extras>.
  40"""
  41
  42cmdln_desc = """A fast and complete Python implementation of Markdown, a
  43text-to-HTML conversion tool for web writers.
  44
  45Supported extra syntax options (see -x|--extras option below and
  46see <https://github.com/trentm/python-markdown2/wiki/Extras> for details):
  47
  48* admonitions: Enable parsing of RST admonitions.
  49* break-on-newline: Replace single new line characters with <br> when True
  50* code-friendly: Disable _ and __ for em and strong.
  51* cuddled-lists: Allow lists to be cuddled to the preceding paragraph.
  52* fenced-code-blocks: Allows a code block to not have to be indented
  53  by fencing it with '```' on a line before and after. Based on
  54  <http://github.github.com/github-flavored-markdown/> with support for
  55  syntax highlighting.
  56* footnotes: Support footnotes as in use on daringfireball.net and
  57  implemented in other Markdown processors (tho not in Markdown.pl v1.0.1).
  58* header-ids: Adds "id" attributes to headers. The id value is a slug of
  59  the header text.
  60* highlightjs-lang: Allows specifying the language which used for syntax
  61  highlighting when using fenced-code-blocks and highlightjs.
  62* html-classes: Takes a dict mapping html tag names (lowercase) to a
  63  string to use for a "class" tag attribute. Currently only supports "img",
  64  "table", "pre" and "code" tags. Add an issue if you require this for other
  65  tags.
  66* link-patterns: Auto-link given regex patterns in text (e.g. bug number
  67  references, revision number references).
  68* markdown-in-html: Allow the use of `markdown="1"` in a block HTML tag to
  69  have markdown processing be done on its contents. Similar to
  70  <http://michelf.com/projects/php-markdown/extra/#markdown-attr> but with
  71  some limitations.
  72* metadata: Extract metadata from a leading '---'-fenced block.
  73  See <https://github.com/trentm/python-markdown2/issues/77> for details.
  74* nofollow: Add `rel="nofollow"` to add `<a>` tags with an href. See
  75  <http://en.wikipedia.org/wiki/Nofollow>.
  76* numbering: Support of generic counters.  Non standard extension to
  77  allow sequential numbering of figures, tables, equations, exhibits etc.
  78* pyshell: Treats unindented Python interactive shell sessions as <code>
  79  blocks.
  80* smarty-pants: Replaces ' and " with curly quotation marks or curly
  81  apostrophes.  Replaces --, ---, ..., and . . . with en dashes, em dashes,
  82  and ellipses.
  83* spoiler: A special kind of blockquote commonly hidden behind a
  84  click on SO. Syntax per <http://meta.stackexchange.com/a/72878>.
  85* strike: text inside of double tilde is ~~strikethrough~~
  86* tag-friendly: Requires atx style headers to have a space between the # and
  87  the header text. Useful for applications that require twitter style tags to
  88  pass through the parser.
  89* tables: Tables using the same format as GFM
  90  <https://help.github.com/articles/github-flavored-markdown#tables> and
  91  PHP-Markdown Extra <https://michelf.ca/projects/php-markdown/extra/#table>.
  92* toc: The returned HTML string gets a new "toc_html" attribute which is
  93  a Table of Contents for the document. (experimental)
  94* use-file-vars: Look for an Emacs-style markdown-extras file variable to turn
  95  on Extras.
  96* wiki-tables: Google Code Wiki-style tables. See
  97  <http://code.google.com/p/support/wiki/WikiSyntax#Tables>.
  98* xml: Passes one-liner processing instructions and namespaced XML tags.
  99"""
 100
 101# Dev Notes:
 102# - Python's regex syntax doesn't have '\z', so I'm using '\Z'. I'm
 103#   not yet sure if there implications with this. Compare 'pydoc sre'
 104#   and 'perldoc perlre'.
 105
 106__version_info__ = (2, 4, 4)
 107__version__ = '.'.join(map(str, __version_info__))
 108__author__ = "Trent Mick"
 109
 110import sys
 111import re
 112import logging
 113from hashlib import sha256
 114import optparse
 115from random import random, randint
 116import codecs
 117from collections import defaultdict
 118
 119# ---- globals
 120
 121DEBUG = False
 122log = logging.getLogger("markdown")
 123
 124DEFAULT_TAB_WIDTH = 4
 125
 126SECRET_SALT = bytes(randint(0, 1000000))
 127
 128
 129# MD5 function was previously used for this; the "md5" prefix was kept for
 130# backwards compatibility.
 131def _hash_text(s):
 132    return 'md5-' + sha256(SECRET_SALT + s.encode("utf-8")).hexdigest()[32:]
 133
 134
 135# Table of hash values for escaped characters:
 136g_escape_table = dict([(ch, _hash_text(ch))
 137                       for ch in '\\`*_{}[]()>#+-.!'])
 138
 139# Ampersand-encoding based entirely on Nat Irons's Amputator MT plugin:
 140#   http://bumppo.net/projects/amputator/
 141_AMPERSAND_RE = re.compile(r'&(?!#?[xX]?(?:[0-9a-fA-F]+|\w+);)')
 142
 143
 144# ---- exceptions
 145class MarkdownError(Exception):
 146    pass
 147
 148
 149# ---- public api
 150
 151def markdown_path(path, encoding="utf-8",
 152                  html4tags=False, tab_width=DEFAULT_TAB_WIDTH,
 153                  safe_mode=None, extras=None, link_patterns=None,
 154                  footnote_title=None, footnote_return_symbol=None,
 155                  use_file_vars=False):
 156    fp = codecs.open(path, 'r', encoding)
 157    text = fp.read()
 158    fp.close()
 159    return Markdown(html4tags=html4tags, tab_width=tab_width,
 160                    safe_mode=safe_mode, extras=extras,
 161                    link_patterns=link_patterns,
 162                    footnote_title=footnote_title,
 163                    footnote_return_symbol=footnote_return_symbol,
 164                    use_file_vars=use_file_vars).convert(text)
 165
 166
 167def markdown(text, html4tags=False, tab_width=DEFAULT_TAB_WIDTH,
 168             safe_mode=None, extras=None, link_patterns=None,
 169             footnote_title=None, footnote_return_symbol=None,
 170             use_file_vars=False, cli=False):
 171    return Markdown(html4tags=html4tags, tab_width=tab_width,
 172                    safe_mode=safe_mode, extras=extras,
 173                    link_patterns=link_patterns,
 174                    footnote_title=footnote_title,
 175                    footnote_return_symbol=footnote_return_symbol,
 176                    use_file_vars=use_file_vars, cli=cli).convert(text)
 177
 178
 179class Markdown(object):
 180    # The dict of "extras" to enable in processing -- a mapping of
 181    # extra name to argument for the extra. Most extras do not have an
 182    # argument, in which case the value is None.
 183    #
 184    # This can be set via (a) subclassing and (b) the constructor
 185    # "extras" argument.
 186    extras = None
 187
 188    urls = None
 189    titles = None
 190    html_blocks = None
 191    html_spans = None
 192    html_removed_text = "{(#HTML#)}"  # placeholder removed text that does not trigger bold
 193    html_removed_text_compat = "[HTML_REMOVED]"  # for compat with markdown.py
 194
 195    _toc = None
 196
 197    # Used to track when we're inside an ordered or unordered list
 198    # (see _ProcessListItems() for details):
 199    list_level = 0
 200
 201    _ws_only_line_re = re.compile(r"^[ \t]+$", re.M)
 202
 203    def __init__(self, html4tags=False, tab_width=4, safe_mode=None,
 204                 extras=None, link_patterns=None,
 205                 footnote_title=None, footnote_return_symbol=None,
 206                 use_file_vars=False, cli=False):
 207        if html4tags:
 208            self.empty_element_suffix = ">"
 209        else:
 210            self.empty_element_suffix = " />"
 211        self.tab_width = tab_width
 212        self.tab = tab_width * " "
 213
 214        # For compatibility with earlier markdown2.py and with
 215        # markdown.py's safe_mode being a boolean,
 216        #   safe_mode == True -> "replace"
 217        if safe_mode is True:
 218            self.safe_mode = "replace"
 219        else:
 220            self.safe_mode = safe_mode
 221
 222        # Massaging and building the "extras" info.
 223        if self.extras is None:
 224            self.extras = {}
 225        elif not isinstance(self.extras, dict):
 226            self.extras = dict([(e, None) for e in self.extras])
 227        if extras:
 228            if not isinstance(extras, dict):
 229                extras = dict([(e, None) for e in extras])
 230            self.extras.update(extras)
 231        assert isinstance(self.extras, dict)
 232
 233        if "toc" in self.extras:
 234            if "header-ids" not in self.extras:
 235                self.extras["header-ids"] = None  # "toc" implies "header-ids"
 236
 237            if self.extras["toc"] is None:
 238                self._toc_depth = 6
 239            else:
 240                self._toc_depth = self.extras["toc"].get("depth", 6)
 241        self._instance_extras = self.extras.copy()
 242
 243        if 'link-patterns' in self.extras:
 244            if link_patterns is None:
 245                # if you have specified that the link-patterns extra SHOULD
 246                # be used (via self.extras) but you haven't provided anything
 247                # via the link_patterns argument then an error is raised
 248                raise MarkdownError("If the 'link-patterns' extra is used, an argument for 'link_patterns' is required")
 249        self.link_patterns = link_patterns
 250        self.footnote_title = footnote_title
 251        self.footnote_return_symbol = footnote_return_symbol
 252        self.use_file_vars = use_file_vars
 253        self._outdent_re = re.compile(r'^(\t|[ ]{1,%d})' % tab_width, re.M)
 254        self.cli = cli
 255
 256        self._escape_table = g_escape_table.copy()
 257        self._code_table = {}
 258        if "smarty-pants" in self.extras:
 259            self._escape_table['"'] = _hash_text('"')
 260            self._escape_table["'"] = _hash_text("'")
 261
 262    def reset(self):
 263        self.urls = {}
 264        self.titles = {}
 265        self.html_blocks = {}
 266        self.html_spans = {}
 267        self.list_level = 0
 268        self.extras = self._instance_extras.copy()
 269        self._setup_extras()
 270        self._toc = None
 271
 272    def _setup_extras(self):
 273        if "footnotes" in self.extras:
 274            self.footnotes = {}
 275            self.footnote_ids = []
 276        if "header-ids" in self.extras:
 277            self._count_from_header_id = defaultdict(int)
 278        if "metadata" in self.extras:
 279            self.metadata = {}
 280
 281    # Per <https://developer.mozilla.org/en-US/docs/HTML/Element/a> "rel"
 282    # should only be used in <a> tags with an "href" attribute.
 283
 284    # Opens the linked document in a new window or tab
 285    # should only used in <a> tags with an "href" attribute.
 286    # same with _a_nofollow
 287    _a_nofollow_or_blank_links = re.compile(r"""
 288        <(a)
 289        (
 290            [^>]*
 291            href=   # href is required
 292            ['"]?   # HTML5 attribute values do not have to be quoted
 293            [^#'"]  # We don't want to match href values that start with # (like footnotes)
 294        )
 295        """,
 296                                            re.IGNORECASE | re.VERBOSE
 297                                            )
 298
 299    def convert(self, text):
 300        """Convert the given text."""
 301        # Main function. The order in which other subs are called here is
 302        # essential. Link and image substitutions need to happen before
 303        # _EscapeSpecialChars(), so that any *'s or _'s in the <a>
 304        # and <img> tags get encoded.
 305
 306        # Clear the global hashes. If we don't clear these, you get conflicts
 307        # from other articles when generating a page which contains more than
 308        # one article (e.g. an index page that shows the N most recent
 309        # articles):
 310        self.reset()
 311
 312        if not isinstance(text, str):
 313            # TODO: perhaps shouldn't presume UTF-8 for string input?
 314            text = str(text, 'utf-8')
 315
 316        if self.use_file_vars:
 317            # Look for emacs-style file variable hints.
 318            text = self._emacs_oneliner_vars_pat.sub(self._emacs_vars_oneliner_sub, text)
 319            emacs_vars = self._get_emacs_vars(text)
 320            if "markdown-extras" in emacs_vars:
 321                splitter = re.compile("[ ,]+")
 322                for e in splitter.split(emacs_vars["markdown-extras"]):
 323                    if '=' in e:
 324                        ename, earg = e.split('=', 1)
 325                        try:
 326                            earg = int(earg)
 327                        except ValueError:
 328                            pass
 329                    else:
 330                        ename, earg = e, None
 331                    self.extras[ename] = earg
 332
 333            self._setup_extras()
 334
 335        # Standardize line endings:
 336        text = text.replace("\r\n", "\n")
 337        text = text.replace("\r", "\n")
 338
 339        # Make sure $text ends with a couple of newlines:
 340        text += "\n\n"
 341
 342        # Convert all tabs to spaces.
 343        text = self._detab(text)
 344
 345        # Strip any lines consisting only of spaces and tabs.
 346        # This makes subsequent regexen easier to write, because we can
 347        # match consecutive blank lines with /\n+/ instead of something
 348        # contorted like /[ \t]*\n+/ .
 349        text = self._ws_only_line_re.sub("", text)
 350
 351        # strip metadata from head and extract
 352        if "metadata" in self.extras:
 353            text = self._extract_metadata(text)
 354
 355        text = self.preprocess(text)
 356
 357        if "fenced-code-blocks" in self.extras and not self.safe_mode:
 358            text = self._do_fenced_code_blocks(text)
 359
 360        if self.safe_mode:
 361            text = self._hash_html_spans(text)
 362
 363        # Turn block-level HTML blocks into hash entries
 364        text = self._hash_html_blocks(text, raw=True)
 365
 366        if "fenced-code-blocks" in self.extras and self.safe_mode:
 367            text = self._do_fenced_code_blocks(text)
 368
 369        if 'admonitions' in self.extras:
 370            text = self._do_admonitions(text)
 371
 372        # Because numbering references aren't links (yet?) then we can do everything associated with counters
 373        # before we get started
 374        if "numbering" in self.extras:
 375            text = self._do_numbering(text)
 376
 377        # Strip link definitions, store in hashes.
 378        if "footnotes" in self.extras:
 379            # Must do footnotes first because an unlucky footnote defn
 380            # looks like a link defn:
 381            #   [^4]: this "looks like a link defn"
 382            text = self._strip_footnote_definitions(text)
 383        text = self._strip_link_definitions(text)
 384
 385        text = self._run_block_gamut(text)
 386
 387        if "footnotes" in self.extras:
 388            text = self._add_footnotes(text)
 389
 390        text = self.postprocess(text)
 391
 392        text = self._unescape_special_chars(text)
 393
 394        if self.safe_mode:
 395            text = self._unhash_html_spans(text)
 396            # return the removed text warning to its markdown.py compatible form
 397            text = text.replace(self.html_removed_text, self.html_removed_text_compat)
 398
 399        do_target_blank_links = "target-blank-links" in self.extras
 400        do_nofollow_links = "nofollow" in self.extras
 401
 402        if do_target_blank_links and do_nofollow_links:
 403            text = self._a_nofollow_or_blank_links.sub(r'<\1 rel="nofollow noopener" target="_blank"\2', text)
 404        elif do_target_blank_links:
 405            text = self._a_nofollow_or_blank_links.sub(r'<\1 rel="noopener" target="_blank"\2', text)
 406        elif do_nofollow_links:
 407            text = self._a_nofollow_or_blank_links.sub(r'<\1 rel="nofollow"\2', text)
 408
 409        if "toc" in self.extras and self._toc:
 410            self._toc_html = calculate_toc_html(self._toc)
 411
 412            # Prepend toc html to output
 413            if self.cli:
 414                text = '{}\n{}'.format(self._toc_html, text)
 415
 416        text += "\n"
 417
 418        # Attach attrs to output
 419        rv = UnicodeWithAttrs(text)
 420
 421        if "toc" in self.extras and self._toc:
 422            rv.toc_html = self._toc_html
 423
 424        if "metadata" in self.extras:
 425            rv.metadata = self.metadata
 426        return rv
 427
 428    def postprocess(self, text):
 429        """A hook for subclasses to do some postprocessing of the html, if
 430        desired. This is called before unescaping of special chars and
 431        unhashing of raw HTML spans.
 432        """
 433        return text
 434
 435    def preprocess(self, text):
 436        """A hook for subclasses to do some preprocessing of the Markdown, if
 437        desired. This is called after basic formatting of the text, but prior
 438        to any extras, safe mode, etc. processing.
 439        """
 440        return text
 441
 442    # Is metadata if the content starts with optional '---'-fenced `key: value`
 443    # pairs. E.g. (indented for presentation):
 444    #   ---
 445    #   foo: bar
 446    #   another-var: blah blah
 447    #   ---
 448    #   # header
 449    # or:
 450    #   foo: bar
 451    #   another-var: blah blah
 452    #
 453    #   # header
 454    _meta_data_pattern = re.compile(r'''
 455        ^(?:---[\ \t]*\n)?(  # optional opening fence
 456            (?:
 457                [\S \t]*\w[\S \t]*\s*:(?:\n+[ \t]+.*)+  # indented lists
 458            )|(?:
 459                (?:[\S \t]*\w[\S \t]*\s*:\s+>(?:\n\s+.*)+?)  # multiline long descriptions
 460                (?=\n[\S \t]*\w[\S \t]*\s*:\s*.*\n|\s*\Z)  # match up until the start of the next key:value definition or the end of the input text
 461            )|(?:
 462                [\S \t]*\w[\S \t]*\s*:(?! >).*\n?  # simple key:value pair, leading spaces allowed
 463            )
 464        )(?:---[\ \t]*\n)?  # optional closing fence
 465        ''', re.MULTILINE | re.VERBOSE
 466                                    )
 467
 468    _key_val_list_pat = re.compile(
 469        r"^-(?:[ \t]*([^\n]*)(?:[ \t]*[:-][ \t]*(\S+))?)(?:\n((?:[ \t]+[^\n]+\n?)+))?",
 470        re.MULTILINE,
 471    )
 472    _key_val_dict_pat = re.compile(
 473        r"^([^:\n]+)[ \t]*:[ \t]*([^\n]*)(?:((?:\n[ \t]+[^\n]+)+))?", re.MULTILINE
 474    )  # grp0: key, grp1: value, grp2: multiline value
 475    _meta_data_fence_pattern = re.compile(r'^---[\ \t]*\n', re.MULTILINE)
 476    _meta_data_newline = re.compile("^\n", re.MULTILINE)
 477
 478    def _extract_metadata(self, text):
 479        if text.startswith("---"):
 480            fence_splits = re.split(self._meta_data_fence_pattern, text, maxsplit=2)
 481            metadata_content = fence_splits[1]
 482            match = re.findall(self._meta_data_pattern, metadata_content)
 483            if not match:
 484                return text
 485            tail = fence_splits[2]
 486        else:
 487            metadata_split = re.split(self._meta_data_newline, text, maxsplit=1)
 488            metadata_content = metadata_split[0]
 489            match = re.findall(self._meta_data_pattern, metadata_content)
 490            if not match:
 491                return text
 492            tail = metadata_split[1]
 493
 494        def parse_structured_value(value):
 495            vs = value.lstrip()
 496            vs = value.replace(v[: len(value) - len(vs)], "\n")[1:]
 497
 498            # List
 499            if vs.startswith("-"):
 500                r = []
 501                for match in re.findall(self._key_val_list_pat, vs):
 502                    if match[0] and not match[1] and not match[2]:
 503                        r.append(match[0].strip())
 504                    elif match[0] == ">" and not match[1] and match[2]:
 505                        r.append(match[2].strip())
 506                    elif match[0] and match[1]:
 507                        r.append({match[0].strip(): match[1].strip()})
 508                    elif not match[0] and not match[1] and match[2]:
 509                        r.append(parse_structured_value(match[2]))
 510                    else:
 511                        # Broken case
 512                        pass
 513
 514                return r
 515
 516            # Dict
 517            else:
 518                return {
 519                    match[0].strip(): (
 520                        match[1].strip()
 521                        if match[1]
 522                        else parse_structured_value(match[2])
 523                    )
 524                    for match in re.findall(self._key_val_dict_pat, vs)
 525                }
 526
 527        for item in match:
 528
 529            k, v = item.split(":", 1)
 530
 531            # Multiline value
 532            if v[:3] == " >\n":
 533                self.metadata[k.strip()] = _dedent(v[3:]).strip()
 534
 535            # Empty value
 536            elif v == "\n":
 537                self.metadata[k.strip()] = ""
 538
 539            # Structured value
 540            elif v[0] == "\n":
 541                self.metadata[k.strip()] = parse_structured_value(v)
 542
 543            # Simple value
 544            else:
 545                self.metadata[k.strip()] = v.strip()
 546
 547        return tail
 548
 549    _emacs_oneliner_vars_pat = re.compile(r"((?:<!--)?\s*-\*-)\s*(?:(\S[^\r\n]*?)([\r\n]\s*)?)?(-\*-\s*(?:-->)?)",
 550                                          re.UNICODE)
 551    # This regular expression is intended to match blocks like this:
 552    #    PREFIX Local Variables: SUFFIX
 553    #    PREFIX mode: Tcl SUFFIX
 554    #    PREFIX End: SUFFIX
 555    # Some notes:
 556    # - "[ \t]" is used instead of "\s" to specifically exclude newlines
 557    # - "(\r\n|\n|\r)" is used instead of "$" because the sre engine does
 558    #   not like anything other than Unix-style line terminators.
 559    _emacs_local_vars_pat = re.compile(r"""^
 560        (?P<prefix>(?:[^\r\n|\n|\r])*?)
 561        [\ \t]*Local\ Variables:[\ \t]*
 562        (?P<suffix>.*?)(?:\r\n|\n|\r)
 563        (?P<content>.*?\1End:)
 564        """, re.IGNORECASE | re.MULTILINE | re.DOTALL | re.VERBOSE)
 565
 566    def _emacs_vars_oneliner_sub(self, match):
 567        if match.group(1).strip() == '-*-' and match.group(4).strip() == '-*-':
 568            lead_ws = re.findall(r'^\s*', match.group(1))[0]
 569            tail_ws = re.findall(r'\s*$', match.group(4))[0]
 570            return '%s<!-- %s %s %s -->%s' % (lead_ws, '-*-', match.group(2).strip(), '-*-', tail_ws)
 571
 572        start, end = match.span()
 573        return match.string[start: end]
 574
 575    def _get_emacs_vars(self, text):
 576        """Return a dictionary of emacs-style local variables.
 577
 578        Parsing is done loosely according to this spec (and according to
 579        some in-practice deviations from this):
 580        http://www.gnu.org/software/emacs/manual/html_node/emacs/Specifying-File-Variables.html#Specifying-File-Variables
 581        """
 582        emacs_vars = {}
 583        SIZE = pow(2, 13)  # 8kB
 584
 585        # Search near the start for a '-*-'-style one-liner of variables.
 586        head = text[:SIZE]
 587        if "-*-" in head:
 588            match = self._emacs_oneliner_vars_pat.search(head)
 589            if match:
 590                emacs_vars_str = match.group(2)
 591                assert '\n' not in emacs_vars_str
 592                emacs_var_strs = [s.strip() for s in emacs_vars_str.split(';')
 593                                  if s.strip()]
 594                if len(emacs_var_strs) == 1 and ':' not in emacs_var_strs[0]:
 595                    # While not in the spec, this form is allowed by emacs:
 596                    #   -*- Tcl -*-
 597                    # where the implied "variable" is "mode". This form
 598                    # is only allowed if there are no other variables.
 599                    emacs_vars["mode"] = emacs_var_strs[0].strip()
 600                else:
 601                    for emacs_var_str in emacs_var_strs:
 602                        try:
 603                            variable, value = emacs_var_str.strip().split(':', 1)
 604                        except ValueError:
 605                            log.debug("emacs variables error: malformed -*- "
 606                                      "line: %r", emacs_var_str)
 607                            continue
 608                        # Lowercase the variable name because Emacs allows "Mode"
 609                        # or "mode" or "MoDe", etc.
 610                        emacs_vars[variable.lower()] = value.strip()
 611
 612        tail = text[-SIZE:]
 613        if "Local Variables" in tail:
 614            match = self._emacs_local_vars_pat.search(tail)
 615            if match:
 616                prefix = match.group("prefix")
 617                suffix = match.group("suffix")
 618                lines = match.group("content").splitlines(0)
 619                # print "prefix=%r, suffix=%r, content=%r, lines: %s"\
 620                #      % (prefix, suffix, match.group("content"), lines)
 621
 622                # Validate the Local Variables block: proper prefix and suffix
 623                # usage.
 624                for i, line in enumerate(lines):
 625                    if not line.startswith(prefix):
 626                        log.debug("emacs variables error: line '%s' "
 627                                  "does not use proper prefix '%s'"
 628                                  % (line, prefix))
 629                        return {}
 630                    # Don't validate suffix on last line. Emacs doesn't care,
 631                    # neither should we.
 632                    if i != len(lines) - 1 and not line.endswith(suffix):
 633                        log.debug("emacs variables error: line '%s' "
 634                                  "does not use proper suffix '%s'"
 635                                  % (line, suffix))
 636                        return {}
 637
 638                # Parse out one emacs var per line.
 639                continued_for = None
 640                for line in lines[:-1]:  # no var on the last line ("PREFIX End:")
 641                    if prefix: line = line[len(prefix):]  # strip prefix
 642                    if suffix: line = line[:-len(suffix)]  # strip suffix
 643                    line = line.strip()
 644                    if continued_for:
 645                        variable = continued_for
 646                        if line.endswith('\\'):
 647                            line = line[:-1].rstrip()
 648                        else:
 649                            continued_for = None
 650                        emacs_vars[variable] += ' ' + line
 651                    else:
 652                        try:
 653                            variable, value = line.split(':', 1)
 654                        except ValueError:
 655                            log.debug("local variables error: missing colon "
 656                                      "in local variables entry: '%s'" % line)
 657                            continue
 658                        # Do NOT lowercase the variable name, because Emacs only
 659                        # allows "mode" (and not "Mode", "MoDe", etc.) in this block.
 660                        value = value.strip()
 661                        if value.endswith('\\'):
 662                            value = value[:-1].rstrip()
 663                            continued_for = variable
 664                        else:
 665                            continued_for = None
 666                        emacs_vars[variable] = value
 667
 668        # Unquote values.
 669        for var, val in list(emacs_vars.items()):
 670            if len(val) > 1 and (val.startswith('"') and val.endswith('"')
 671                                 or val.startswith('"') and val.endswith('"')):
 672                emacs_vars[var] = val[1:-1]
 673
 674        return emacs_vars
 675
 676    def _detab_line(self, line):
 677        r"""Recusively convert tabs to spaces in a single line.
 678
 679        Called from _detab()."""
 680        if '\t' not in line:
 681            return line
 682        chunk1, chunk2 = line.split('\t', 1)
 683        chunk1 += (' ' * (self.tab_width - len(chunk1) % self.tab_width))
 684        output = chunk1 + chunk2
 685        return self._detab_line(output)
 686
 687    def _detab(self, text):
 688        r"""Iterate text line by line and convert tabs to spaces.
 689
 690            >>> m = Markdown()
 691            >>> m._detab("\tfoo")
 692            '    foo'
 693            >>> m._detab("  \tfoo")
 694            '    foo'
 695            >>> m._detab("\t  foo")
 696            '      foo'
 697            >>> m._detab("  foo")
 698            '  foo'
 699            >>> m._detab("  foo\n\tbar\tblam")
 700            '  foo\n    bar blam'
 701        """
 702        if '\t' not in text:
 703            return text
 704        output = []
 705        for line in text.splitlines():
 706            output.append(self._detab_line(line))
 707        return '\n'.join(output)
 708
 709    # I broke out the html5 tags here and add them to _block_tags_a and
 710    # _block_tags_b.  This way html5 tags are easy to keep track of.
 711    _html5tags = '|article|aside|header|hgroup|footer|nav|section|figure|figcaption'
 712
 713    _block_tags_a = 'p|div|h[1-6]|blockquote|pre|table|dl|ol|ul|script|noscript|form|fieldset|iframe|math|ins|del'
 714    _block_tags_a += _html5tags
 715
 716    _strict_tag_block_re = re.compile(r"""
 717        (                       # save in \1
 718            ^                   # start of line  (with re.M)
 719            <(%s)               # start tag = \2
 720            \b                  # word break
 721            (.*\n)*?            # any number of lines, minimally matching
 722            </\2>               # the matching end tag
 723            [ \t]*              # trailing spaces/tabs
 724            (?=\n+|\Z)          # followed by a newline or end of document
 725        )
 726        """ % _block_tags_a,
 727                                      re.X | re.M)
 728
 729    _block_tags_b = 'p|div|h[1-6]|blockquote|pre|table|dl|ol|ul|script|noscript|form|fieldset|iframe|math'
 730    _block_tags_b += _html5tags
 731
 732    _liberal_tag_block_re = re.compile(r"""
 733        (                       # save in \1
 734            ^                   # start of line  (with re.M)
 735            <(%s)               # start tag = \2
 736            \b                  # word break
 737            (.*\n)*?            # any number of lines, minimally matching
 738            .*</\2>             # the matching end tag
 739            [ \t]*              # trailing spaces/tabs
 740            (?=\n+|\Z)          # followed by a newline or end of document
 741        )
 742        """ % _block_tags_b,
 743                                       re.X | re.M)
 744
 745    _html_markdown_attr_re = re.compile(
 746        r'''\s+markdown=("1"|'1')''')
 747
 748    def _hash_html_block_sub(self, match, raw=False):
 749        html = match.group(1)
 750        if raw and self.safe_mode:
 751            html = self._sanitize_html(html)
 752        elif 'markdown-in-html' in self.extras and 'markdown=' in html:
 753            first_line = html.split('\n', 1)[0]
 754            m = self._html_markdown_attr_re.search(first_line)
 755            if m:
 756                lines = html.split('\n')
 757                middle = '\n'.join(lines[1:-1])
 758                last_line = lines[-1]
 759                first_line = first_line[:m.start()] + first_line[m.end():]
 760                f_key = _hash_text(first_line)
 761                self.html_blocks[f_key] = first_line
 762                l_key = _hash_text(last_line)
 763                self.html_blocks[l_key] = last_line
 764                return ''.join(["\n\n", f_key,
 765                                "\n\n", middle, "\n\n",
 766                                l_key, "\n\n"])
 767        key = _hash_text(html)
 768        self.html_blocks[key] = html
 769        return "\n\n" + key + "\n\n"
 770
 771    def _hash_html_blocks(self, text, raw=False):
 772        """Hashify HTML blocks
 773
 774        We only want to do this for block-level HTML tags, such as headers,
 775        lists, and tables. That's because we still want to wrap <p>s around
 776        "paragraphs" that are wrapped in non-block-level tags, such as anchors,
 777        phrase emphasis, and spans. The list of tags we're looking for is
 778        hard-coded.
 779
 780        @param raw {boolean} indicates if these are raw HTML blocks in
 781            the original source. It makes a difference in "safe" mode.
 782        """
 783        if '<' not in text:
 784            return text
 785
 786        # Pass `raw` value into our calls to self._hash_html_block_sub.
 787        hash_html_block_sub = _curry(self._hash_html_block_sub, raw=raw)
 788
 789        # First, look for nested blocks, e.g.:
 790        #   <div>
 791        #       <div>
 792        #       tags for inner block must be indented.
 793        #       </div>
 794        #   </div>
 795        #
 796        # The outermost tags must start at the left margin for this to match, and
 797        # the inner nested divs must be indented.
 798        # We need to do this before the next, more liberal match, because the next
 799        # match will start at the first `<div>` and stop at the first `</div>`.
 800        text = self._strict_tag_block_re.sub(hash_html_block_sub, text)
 801
 802        # Now match more liberally, simply from `\n<tag>` to `</tag>\n`
 803        text = self._liberal_tag_block_re.sub(hash_html_block_sub, text)
 804
 805        # Special case just for <hr />. It was easier to make a special
 806        # case than to make the other regex more complicated.
 807        if "<hr" in text:
 808            _hr_tag_re = _hr_tag_re_from_tab_width(self.tab_width)
 809            text = _hr_tag_re.sub(hash_html_block_sub, text)
 810
 811        # Special case for standalone HTML comments:
 812        if "<!--" in text:
 813            start = 0
 814            while True:
 815                # Delimiters for next comment block.
 816                try:
 817                    start_idx = text.index("<!--", start)
 818                except ValueError:
 819                    break
 820                try:
 821                    end_idx = text.index("-->", start_idx) + 3
 822                except ValueError:
 823                    break
 824
 825                # Start position for next comment block search.
 826                start = end_idx
 827
 828                # Validate whitespace before comment.
 829                if start_idx:
 830                    # - Up to `tab_width - 1` spaces before start_idx.
 831                    for i in range(self.tab_width - 1):
 832                        if text[start_idx - 1] != ' ':
 833                            break
 834                        start_idx -= 1
 835                        if start_idx == 0:
 836                            break
 837                    # - Must be preceded by 2 newlines or hit the start of
 838                    #   the document.
 839                    if start_idx == 0:
 840                        pass
 841                    elif start_idx == 1 and text[0] == '\n':
 842                        start_idx = 0  # to match minute detail of Markdown.pl regex
 843                    elif text[start_idx - 2:start_idx] == '\n\n':
 844                        pass
 845                    else:
 846                        break
 847
 848                # Validate whitespace after comment.
 849                # - Any number of spaces and tabs.
 850                while end_idx < len(text):
 851                    if text[end_idx] not in ' \t':
 852                        break
 853                    end_idx += 1
 854                # - Must be following by 2 newlines or hit end of text.
 855                if text[end_idx:end_idx + 2] not in ('', '\n', '\n\n'):
 856                    continue
 857
 858                # Escape and hash (must match `_hash_html_block_sub`).
 859                html = text[start_idx:end_idx]
 860                if raw and self.safe_mode:
 861                    html = self._sanitize_html(html)
 862                key = _hash_text(html)
 863                self.html_blocks[key] = html
 864                text = text[:start_idx] + "\n\n" + key + "\n\n" + text[end_idx:]
 865
 866        if "xml" in self.extras:
 867            # Treat XML processing instructions and namespaced one-liner
 868            # tags as if they were block HTML tags. E.g., if standalone
 869            # (i.e. are their own paragraph), the following do not get
 870            # wrapped in a <p> tag:
 871            #    <?foo bar?>
 872            #
 873            #    <xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="chapter_1.md"/>
 874            _xml_oneliner_re = _xml_oneliner_re_from_tab_width(self.tab_width)
 875            text = _xml_oneliner_re.sub(hash_html_block_sub, text)
 876
 877        return text
 878
 879    def _strip_link_definitions(self, text):
 880        # Strips link definitions from text, stores the URLs and titles in
 881        # hash references.
 882        less_than_tab = self.tab_width - 1
 883
 884        # Link defs are in the form:
 885        #   [id]: url "optional title"
 886        _link_def_re = re.compile(r"""
 887            ^[ ]{0,%d}\[(.+)\]: # id = \1
 888              [ \t]*
 889              \n?               # maybe *one* newline
 890              [ \t]*
 891            <?(.+?)>?           # url = \2
 892              [ \t]*
 893            (?:
 894                \n?             # maybe one newline
 895                [ \t]*
 896                (?<=\s)         # lookbehind for whitespace
 897                ['"(]
 898                ([^\n]*)        # title = \3
 899                ['")]
 900                [ \t]*
 901            )?  # title is optional
 902            (?:\n+|\Z)
 903            """ % less_than_tab, re.X | re.M | re.U)
 904        return _link_def_re.sub(self._extract_link_def_sub, text)
 905
 906    def _extract_link_def_sub(self, match):
 907        id, url, title = match.groups()
 908        key = id.lower()  # Link IDs are case-insensitive
 909        self.urls[key] = self._encode_amps_and_angles(url)
 910        if title:
 911            self.titles[key] = title
 912        return ""
 913
 914    def _do_numbering(self, text):
 915        ''' We handle the special extension for generic numbering for
 916            tables, figures etc.
 917        '''
 918        # First pass to define all the references
 919        self.regex_defns = re.compile(r'''
 920            \[\#(\w+) # the counter.  Open square plus hash plus a word \1
 921            ([^@]*)   # Some optional characters, that aren't an @. \2
 922            @(\w+)       # the id.  Should this be normed? \3
 923            ([^\]]*)\]   # The rest of the text up to the terminating ] \4
 924            ''', re.VERBOSE)
 925        self.regex_subs = re.compile(r"\[@(\w+)\s*\]")  # [@ref_id]
 926        counters = {}
 927        references = {}
 928        replacements = []
 929        definition_html = '<figcaption class="{}" id="counter-ref-{}">{}{}{}</figcaption>'
 930        reference_html = '<a class="{}" href="#counter-ref-{}">{}</a>'
 931        for match in self.regex_defns.finditer(text):
 932            # We must have four match groups otherwise this isn't a numbering reference
 933            if len(match.groups()) != 4:
 934                continue
 935            counter = match.group(1)
 936            text_before = match.group(2).strip()
 937            ref_id = match.group(3)
 938            text_after = match.group(4)
 939            number = counters.get(counter, 1)
 940            references[ref_id] = (number, counter)
 941            replacements.append((match.start(0),
 942                                 definition_html.format(counter,
 943                                                        ref_id,
 944                                                        text_before,
 945                                                        number,
 946                                                        text_after),
 947                                 match.end(0)))
 948            counters[counter] = number + 1
 949        for repl in reversed(replacements):
 950            text = text[:repl[0]] + repl[1] + text[repl[2]:]
 951
 952        # Second pass to replace the references with the right
 953        # value of the counter
 954        # Fwiw, it's vaguely annoying to have to turn the iterator into
 955        # a list and then reverse it but I can't think of a better thing to do.
 956        for match in reversed(list(self.regex_subs.finditer(text))):
 957            number, counter = references.get(match.group(1), (None, None))
 958            if number is not None:
 959                repl = reference_html.format(counter,
 960                                             match.group(1),
 961                                             number)
 962            else:
 963                repl = reference_html.format(match.group(1),
 964                                             'countererror',
 965                                             '?' + match.group(1) + '?')
 966            if "smarty-pants" in self.extras:
 967                repl = repl.replace('"', self._escape_table['"'])
 968
 969            text = text[:match.start()] + repl + text[match.end():]
 970        return text
 971
 972    def _extract_footnote_def_sub(self, match):
 973        id, text = match.groups()
 974        text = _dedent(text, skip_first_line=not text.startswith('\n')).strip()
 975        normed_id = re.sub(r'\W', '-', id)
 976        # Ensure footnote text ends with a couple newlines (for some
 977        # block gamut matches).
 978        self.footnotes[normed_id] = text + "\n\n"
 979        return ""
 980
 981    def _strip_footnote_definitions(self, text):
 982        """A footnote definition looks like this:
 983
 984            [^note-id]: Text of the note.
 985
 986                May include one or more indented paragraphs.
 987
 988        Where,
 989        - The 'note-id' can be pretty much anything, though typically it
 990          is the number of the footnote.
 991        - The first paragraph may start on the next line, like so:
 992
 993            [^note-id]:
 994                Text of the note.
 995        """
 996        less_than_tab = self.tab_width - 1
 997        footnote_def_re = re.compile(r'''
 998            ^[ ]{0,%d}\[\^(.+)\]:   # id = \1
 999            [ \t]*
1000            (                       # footnote text = \2
1001              # First line need not start with the spaces.
1002              (?:\s*.*\n+)
1003              (?:
1004                (?:[ ]{%d} | \t)  # Subsequent lines must be indented.
1005                .*\n+
1006              )*
1007            )
1008            # Lookahead for non-space at line-start, or end of doc.
1009            (?:(?=^[ ]{0,%d}\S)|\Z)
1010            ''' % (less_than_tab, self.tab_width, self.tab_width),
1011                                     re.X | re.M)
1012        return footnote_def_re.sub(self._extract_footnote_def_sub, text)
1013
1014    _hr_re = re.compile(r'^[ ]{0,3}([-_*])[ ]{0,2}(\1[ ]{0,2}){2,}$', re.M)
1015
1016    def _run_block_gamut(self, text):
1017        # These are all the transformations that form block-level
1018        # tags like paragraphs, headers, and list items.
1019
1020        if 'admonitions' in self.extras:
1021            text = self._do_admonitions(text)
1022
1023        if "fenced-code-blocks" in self.extras:
1024            text = self._do_fenced_code_blocks(text)
1025
1026        text = self._do_headers(text)
1027
1028        # Do Horizontal Rules:
1029        # On the number of spaces in horizontal rules: The spec is fuzzy: "If
1030        # you wish, you may use spaces between the hyphens or asterisks."
1031        # Markdown.pl 1.0.1's hr regexes limit the number of spaces between the
1032        # hr chars to one or two. We'll reproduce that limit here.
1033        hr = "\n<hr" + self.empty_element_suffix + "\n"
1034        text = re.sub(self._hr_re, hr, text)
1035
1036        text = self._do_lists(text)
1037
1038        if "pyshell" in self.extras:
1039            text = self._prepare_pyshell_blocks(text)
1040        if "wiki-tables" in self.extras:
1041            text = self._do_wiki_tables(text)
1042        if "tables" in self.extras:
1043            text = self._do_tables(text)
1044
1045        text = self._do_code_blocks(text)
1046
1047        text = self._do_block_quotes(text)
1048
1049        # We already ran _HashHTMLBlocks() before, in Markdown(), but that
1050        # was to escape raw HTML in the original Markdown source. This time,
1051        # we're escaping the markup we've just created, so that we don't wrap
1052        # <p> tags around block-level tags.
1053        text = self._hash_html_blocks(text)
1054
1055        text = self._form_paragraphs(text)
1056
1057        return text
1058
1059    def _pyshell_block_sub(self, match):
1060        if "fenced-code-blocks" in self.extras:
1061            dedented = _dedent(match.group(0))
1062            return self._do_fenced_code_blocks("```pycon\n" + dedented + "```\n")
1063        lines = match.group(0).splitlines(0)
1064        _dedentlines(lines)
1065        indent = ' ' * self.tab_width
1066        s = ('\n'  # separate from possible cuddled paragraph
1067             + indent + ('\n' + indent).join(lines)
1068             + '\n')
1069        return s
1070
1071    def _prepare_pyshell_blocks(self, text):
1072        """Ensure that Python interactive shell sessions are put in
1073        code blocks -- even if not properly indented.
1074        """
1075        if ">>>" not in text:
1076            return text
1077
1078        less_than_tab = self.tab_width - 1
1079        _pyshell_block_re = re.compile(r"""
1080            ^([ ]{0,%d})>>>[ ].*\n  # first line
1081            ^(\1[^\S\n]*\S.*\n)*    # any number of subsequent lines with at least one character
1082            (?=^\1?\n|\Z)           # ends with a blank line or end of document
1083            """ % less_than_tab, re.M | re.X)
1084
1085        return _pyshell_block_re.sub(self._pyshell_block_sub, text)
1086
1087    def _table_sub(self, match):
1088        trim_space_re = '^[ \t\n]+|[ \t\n]+$'
1089        trim_bar_re = r'^\||\|$'
1090        split_bar_re = r'^\||(?<![\`\\])\|'
1091        escape_bar_re = r'\\\|'
1092
1093        head, underline, body = match.groups()
1094
1095        # Determine aligns for columns.
1096        cols = [re.sub(escape_bar_re, '|', cell.strip()) for cell in
1097                re.split(split_bar_re, re.sub(trim_bar_re, "", re.sub(trim_space_re, "", underline)))]
1098        align_from_col_idx = {}
1099        for col_idx, col in enumerate(cols):
1100            if col[0] == ':' and col[-1] == ':':
1101                align_from_col_idx[col_idx] = ' style="text-align:center;"'
1102            elif col[0] == ':':
1103                align_from_col_idx[col_idx] = ' style="text-align:left;"'
1104            elif col[-1] == ':':
1105                align_from_col_idx[col_idx] = ' style="text-align:right;"'
1106
1107        # thead
1108        hlines = ['<table%s>' % self._html_class_str_from_tag('table'), '<thead>', '<tr>']
1109        cols = [re.sub(escape_bar_re, '|', cell.strip()) for cell in
1110                re.split(split_bar_re, re.sub(trim_bar_re, "", re.sub(trim_space_re, "", head)))]
1111        for col_idx, col in enumerate(cols):
1112            hlines.append('  <th%s>%s</th>' % (
1113                align_from_col_idx.get(col_idx, ''),
1114                self._run_span_gamut(col)
1115            ))
1116        hlines.append('</tr>')
1117        hlines.append('</thead>')
1118
1119        # tbody
1120        hlines.append('<tbody>')
1121        for line in body.strip('\n').split('\n'):
1122            hlines.append('<tr>')
1123            cols = [re.sub(escape_bar_re, '|', cell.strip()) for cell in
1124                    re.split(split_bar_re, re.sub(trim_bar_re, "", re.sub(trim_space_re, "", line)))]
1125            for col_idx, col in enumerate(cols):
1126                hlines.append('  <td%s>%s</td>' % (
1127                    align_from_col_idx.get(col_idx, ''),
1128                    self._run_span_gamut(col)
1129                ))
1130            hlines.append('</tr>')
1131        hlines.append('</tbody>')
1132        hlines.append('</table>')
1133
1134        return '\n'.join(hlines) + '\n'
1135
1136    def _do_tables(self, text):
1137        """Copying PHP-Markdown and GFM table syntax. Some regex borrowed from
1138        https://github.com/michelf/php-markdown/blob/lib/Michelf/Markdown.php#L2538
1139        """
1140        less_than_tab = self.tab_width - 1
1141        table_re = re.compile(r'''
1142                (?:(?<=\n\n)|\A\n?)             # leading blank line
1143
1144                ^[ ]{0,%d}                      # allowed whitespace
1145                (.*[|].*)  \n                   # $1: header row (at least one pipe)
1146
1147                ^[ ]{0,%d}                      # allowed whitespace
1148                (                               # $2: underline row
1149                    # underline row with leading bar
1150                    (?:  \|\ *:?-+:?\ *  )+  \|? \s? \n
1151                    |
1152                    # or, underline row without leading bar
1153                    (?:  \ *:?-+:?\ *\|  )+  (?:  \ *:?-+:?\ *  )? \s? \n
1154                )
1155
1156                (                               # $3: data rows
1157                    (?:
1158                        ^[ ]{0,%d}(?!\ )         # ensure line begins with 0 to less_than_tab spaces
1159                        .*\|.*  \n
1160                    )+
1161                )
1162            ''' % (less_than_tab, less_than_tab, less_than_tab), re.M | re.X)
1163        return table_re.sub(self._table_sub, text)
1164
1165    def _wiki_table_sub(self, match):
1166        ttext = match.group(0).strip()
1167        # print('wiki table: %r' % match.group(0))
1168        rows = []
1169        for line in ttext.splitlines(0):
1170            line = line.strip()[2:-2].strip()
1171            row = [c.strip() for c in re.split(r'(?<!\\)\|\|', line)]
1172            rows.append(row)
1173        # from pprint import pprint
1174        # pprint(rows)
1175        hlines = []
1176
1177        def add_hline(line, indents=0):
1178            hlines.append((self.tab * indents) + line)
1179
1180        def format_cell(text):
1181            return self._run_span_gamut(re.sub(r"^\s*~", "", cell).strip(" "))
1182
1183        add_hline('<table%s>' % self._html_class_str_from_tag('table'))
1184        # Check if first cell of first row is a header cell. If so, assume the whole row is a header row.
1185        if rows and rows[0] and re.match(r"^\s*~", rows[0][0]):
1186            add_hline('<thead>', 1)
1187            add_hline('<tr>', 2)
1188            for cell in rows[0]:
1189                add_hline("<th>{}</th>".format(format_cell(cell)), 3)
1190            add_hline('</tr>', 2)
1191            add_hline('</thead>', 1)
1192            # Only one header row allowed.
1193            rows = rows[1:]
1194        # If no more rows, don't create a tbody.
1195        if rows:
1196            add_hline('<tbody>', 1)
1197            for row in rows:
1198                add_hline('<tr>', 2)
1199                for cell in row:
1200                    add_hline('<td>{}</td>'.format(format_cell(cell)), 3)
1201                add_hline('</tr>', 2)
1202            add_hline('</tbody>', 1)
1203        add_hline('</table>')
1204        return '\n'.join(hlines) + '\n'
1205
1206    def _do_wiki_tables(self, text):
1207        # Optimization.
1208        if "||" not in text:
1209            return text
1210
1211        less_than_tab = self.tab_width - 1
1212        wiki_table_re = re.compile(r'''
1213            (?:(?<=\n\n)|\A\n?)            # leading blank line
1214            ^([ ]{0,%d})\|\|.+?\|\|[ ]*\n  # first line
1215            (^\1\|\|.+?\|\|\n)*        # any number of subsequent lines
1216            ''' % less_than_tab, re.M | re.X)
1217        return wiki_table_re.sub(self._wiki_table_sub, text)
1218
1219    def _run_span_gamut(self, text):
1220        # These are all the transformations that occur *within* block-level
1221        # tags like paragraphs, headers, and list items.
1222
1223        text = self._do_code_spans(text)
1224
1225        text = self._escape_special_chars(text)
1226
1227        # Process anchor and image tags.
1228        if "link-patterns" in self.extras:
1229            text = self._do_link_patterns(text)
1230
1231        text = self._do_links(text)
1232
1233        # Make links out of things like `<http://example.com/>`
1234        # Must come after _do_links(), because you can use < and >
1235        # delimiters in inline links like [this](<url>).
1236        text = self._do_auto_links(text)
1237
1238        text = self._encode_amps_and_angles(text)
1239
1240        if "strike" in self.extras:
1241            text = self._do_strike(text)
1242
1243        if "underline" in self.extras:
1244            text = self._do_underline(text)
1245
1246        text = self._do_italics_and_bold(text)
1247
1248        if "smarty-pants" in self.extras:
1249            text = self._do_smart_punctuation(text)
1250
1251        # Do hard breaks:
1252        if "break-on-newline" in self.extras:
1253            text = re.sub(r" *\n(?!\<(?:\/?(ul|ol|li))\>)", "<br%s\n" % self.empty_element_suffix, text)
1254        else:
1255            text = re.sub(r" {2,}\n", " <br%s\n" % self.empty_element_suffix, text)
1256
1257        return text
1258
1259    # "Sorta" because auto-links are identified as "tag" tokens.
1260    _sorta_html_tokenize_re = re.compile(r"""
1261        (
1262            # tag
1263            </?
1264            (?:\w+)                                     # tag name
1265            (?:\s+(?:[\w-]+:)?[\w-]+=(?:".*?"|'.*?'))*  # attributes
1266            \s*/?>
1267            |
1268            # auto-link (e.g., <http://www.activestate.com/>)
1269            <[\w~:/?#\[\]@!$&'\(\)*+,;%=\.\\-]+>
1270            |
1271            <!--.*?-->      # comment
1272            |
1273            <\?.*?\?>       # processing instruction
1274        )
1275        """, re.X)
1276
1277    def _escape_special_chars(self, text):
1278        # Python markdown note: the HTML tokenization here differs from
1279        # that in Markdown.pl, hence the behaviour for subtle cases can
1280        # differ (I believe the tokenizer here does a better job because
1281        # it isn't susceptible to unmatched '<' and '>' in HTML tags).
1282        # Note, however, that '>' is not allowed in an auto-link URL
1283        # here.
1284        escaped = []
1285        is_html_markup = False
1286        for token in self._sorta_html_tokenize_re.split(text):
1287            if is_html_markup:
1288                # Within tags/HTML-comments/auto-links, encode * and _
1289                # so they don't conflict with their use in Markdown for
1290                # italics and strong.  We're replacing each such
1291                # character with its corresponding MD5 checksum value;
1292                # this is likely overkill, but it should prevent us from
1293                # colliding with the escape values by accident.
1294                escaped.append(token.replace('*', self._escape_table['*'])
1295                               .replace('_', self._escape_table['_']))
1296            else:
1297                escaped.append(self._encode_backslash_escapes(token))
1298            is_html_markup = not is_html_markup
1299        return ''.join(escaped)
1300
1301    def _hash_html_spans(self, text):
1302        # Used for safe_mode.
1303
1304        def _is_auto_link(s):
1305            if ':' in s and self._auto_link_re.match(s):
1306                return True
1307            elif '@' in s and self._auto_email_link_re.match(s):
1308                return True
1309            return False
1310
1311        def _is_code_span(index, token):
1312            try:
1313                if token == '<code>':
1314                    peek_tokens = split_tokens[index: index + 3]
1315                elif token == '</code>':
1316                    peek_tokens = split_tokens[index - 2: index + 1]
1317                else:
1318                    return False
1319            except IndexError:
1320                return False
1321
1322            return re.match(r'<code>md5-[A-Fa-f0-9]{32}</code>', ''.join(peek_tokens))
1323
1324        tokens = []
1325        split_tokens = self._sorta_html_tokenize_re.split(text)
1326        is_html_markup = False
1327        for index, token in enumerate(split_tokens):
1328            if is_html_markup and not _is_auto_link(token) and not _is_code_span(index, token):
1329                sanitized = self._sanitize_html(token)
1330                key = _hash_text(sanitized)
1331                self.html_spans[key] = sanitized
1332                tokens.append(key)
1333            else:
1334                tokens.append(self._encode_incomplete_tags(token))
1335            is_html_markup = not is_html_markup
1336        return ''.join(tokens)
1337
1338    def _unhash_html_spans(self, text):
1339        for key, sanitized in list(self.html_spans.items()):
1340            text = text.replace(key, sanitized)
1341        return text
1342
1343    def _sanitize_html(self, s):
1344        if self.safe_mode == "replace":
1345            return self.html_removed_text
1346        elif self.safe_mode == "escape":
1347            replacements = [
1348                ('&', '&amp;'),
1349                ('<', '&lt;'),
1350                ('>', '&gt;'),
1351            ]
1352            for before, after in replacements:
1353                s = s.replace(before, after)
1354            return s
1355        else:
1356            raise MarkdownError("invalid value for 'safe_mode': %r (must be "
1357                                "'escape' or 'replace')" % self.safe_mode)
1358
1359    _inline_link_title = re.compile(r'''
1360            (                   # \1
1361              [ \t]+
1362              (['"])            # quote char = \2
1363              (?P<title>.*?)
1364              \2
1365            )?                  # title is optional
1366          \)$
1367        ''', re.X | re.S)
1368    _tail_of_reference_link_re = re.compile(r'''
1369          # Match tail of: [text][id]
1370          [ ]?          # one optional space
1371          (?:\n[ ]*)?   # one optional newline followed by spaces
1372          \[
1373            (?P<id>.*?)
1374          \]
1375        ''', re.X | re.S)
1376
1377    _whitespace = re.compile(r'\s*')
1378
1379    _strip_anglebrackets = re.compile(r'<(.*)>.*')
1380
1381    def _find_non_whitespace(self, text, start):
1382        """Returns the index of the first non-whitespace character in text
1383        after (and including) start
1384        """
1385        match = self._whitespace.match(text, start)
1386        return match.end()
1387
1388    def _find_balanced(self, text, start, open_c, close_c):
1389        """Returns the index where the open_c and close_c characters balance
1390        out - the same number of open_c and close_c are encountered - or the
1391        end of string if it's reached before the balance point is found.
1392        """
1393        i = start
1394        l = len(text)
1395        count = 1
1396        while count > 0 and i < l:
1397            if text[i] == open_c:
1398                count += 1
1399            elif text[i] == close_c:
1400                count -= 1
1401            i += 1
1402        return i
1403
1404    def _extract_url_and_title(self, text, start):
1405        """Extracts the url and (optional) title from the tail of a link"""
1406        # text[start] equals the opening parenthesis
1407        idx = self._find_non_whitespace(text, start + 1)
1408        if idx == len(text):
1409            return None, None, None
1410        end_idx = idx
1411        has_anglebrackets = text[idx] == "<"
1412        if has_anglebrackets:
1413            end_idx = self._find_balanced(text, end_idx + 1, "<", ">")
1414        end_idx = self._find_balanced(text, end_idx, "(", ")")
1415        match = self._inline_link_title.search(text, idx, end_idx)
1416        if not match:
1417            return None, None, None
1418        url, title = text[idx:match.start()], match.group("title")
1419        if has_anglebrackets:
1420            url = self._strip_anglebrackets.sub(r'\1', url)
1421        return url, title, end_idx
1422
1423    _safe_protocols = re.compile(r'(https?|ftp):', re.I)
1424
1425    def _do_links(self, text):
1426        """Turn Markdown link shortcuts into XHTML <a> and <img> tags.
1427
1428        This is a combination of Markdown.pl's _DoAnchors() and
1429        _DoImages(). They are done together because that simplified the
1430        approach. It was necessary to use a different approach than
1431        Markdown.pl because of the lack of atomic matching support in
1432        Python's regex engine used in $g_nested_brackets.
1433        """
1434        MAX_LINK_TEXT_SENTINEL = 3000  # markdown2 issue 24
1435
1436        # `anchor_allowed_pos` is used to support img links inside
1437        # anchors, but not anchors inside anchors. An anchor's start
1438        # pos must be `>= anchor_allowed_pos`.
1439        anchor_allowed_pos = 0
1440
1441        curr_pos = 0
1442        while True:  # Handle the next link.
1443            # The next '[' is the start of:
1444            # - an inline anchor:   [text](url "title")
1445            # - a reference anchor: [text][id]
1446            # - an inline img:      ![text](url "title")
1447            # - a reference img:    ![text][id]
1448            # - a footnote ref:     [^id]
1449            #   (Only if 'footnotes' extra enabled)
1450            # - a footnote defn:    [^id]: ...
1451            #   (Only if 'footnotes' extra enabled) These have already
1452            #   been stripped in _strip_footnote_definitions() so no
1453            #   need to watch for them.
1454            # - a link definition:  [id]: url "title"
1455            #   These have already been stripped in
1456            #   _strip_link_definitions() so no need to watch for them.
1457            # - not markup:         [...anything else...
1458            try:
1459                start_idx = text.index('[', curr_pos)
1460            except ValueError:
1461                break
1462            text_length = len(text)
1463
1464            # Find the matching closing ']'.
1465            # Markdown.pl allows *matching* brackets in link text so we
1466            # will here too. Markdown.pl *doesn't* currently allow
1467            # matching brackets in img alt text -- we'll differ in that
1468            # regard.
1469            bracket_depth = 0
1470            for p in range(start_idx + 1, min(start_idx + MAX_LINK_TEXT_SENTINEL,
1471                                              text_length)):
1472                ch = text[p]
1473                if ch == ']':
1474                    bracket_depth -= 1
1475                    if bracket_depth < 0:
1476                        break
1477                elif ch == '[':
1478                    bracket_depth += 1
1479            else:
1480                # Closing bracket not found within sentinel length.
1481                # This isn't markup.
1482                curr_pos = start_idx + 1
1483                continue
1484            link_text = text[start_idx + 1:p]
1485
1486            # Fix for issue 341 - Injecting XSS into link text
1487            if self.safe_mode:
1488                link_text = self._hash_html_spans(link_text)
1489                link_text = self._unhash_html_spans(link_text)
1490
1491            # Possibly a footnote ref?
1492            if "footnotes" in self.extras and link_text.startswith("^"):
1493                normed_id = re.sub(r'\W', '-', link_text[1:])
1494                if normed_id in self.footnotes:
1495                    self.footnote_ids.append(normed_id)
1496                    result = '<sup class="footnote-ref" id="fnref-%s">' \
1497                             '<a href="#fn-%s">%s</a></sup>' \
1498                             % (normed_id, normed_id, len(self.footnote_ids))
1499                    text = text[:start_idx] + result + text[p + 1:]
1500                else:
1501                    # This id isn't defined, leave the markup alone.
1502                    curr_pos = p + 1
1503                continue
1504
1505            # Now determine what this is by the remainder.
1506            p += 1
1507
1508            # Inline anchor or img?
1509            if text[p:p + 1] == '(':  # attempt at perf improvement
1510                url, title, url_end_idx = self._extract_url_and_title(text, p)
1511                if url is not None:
1512                    # Handle an inline anchor or img.
1513                    is_img = start_idx > 0 and text[start_idx - 1] == "!"
1514                    if is_img:
1515                        start_idx -= 1
1516
1517                    # We've got to encode these to avoid conflicting
1518                    # with italics/bold.
1519                    url = url.replace('*', self._escape_table['*']) \
1520                        .replace('_', self._escape_table['_'])
1521                    if title:
1522                        title_str = ' title="%s"' % (
1523                            _xml_escape_attr(title)
1524                            .replace('*', self._escape_table['*'])
1525                            .replace('_', self._escape_table['_']))
1526                    else:
1527                        title_str = ''
1528                    if is_img:
1529                        img_class_str = self._html_class_str_from_tag("img")
1530                        result = '<img src="%s" alt="%s"%s%s%s' \
1531                                 % (_html_escape_url(url, safe_mode=self.safe_mode),
1532                                    _xml_escape_attr(link_text),
1533                                    title_str,
1534                                    img_class_str,
1535                                    self.empty_element_suffix)
1536                        if "smarty-pants" in self.extras:
1537                            result = result.replace('"', self._escape_table['"'])
1538                        curr_pos = start_idx + len(result)
1539                        text = text[:start_idx] + result + text[url_end_idx:]
1540                    elif start_idx >= anchor_allowed_pos:
1541                        safe_link = self._safe_protocols.match(url) or url.startswith('#')
1542                        if self.safe_mode and not safe_link:
1543                            result_head = '<a href="#"%s>' % (title_str)
1544                        else:
1545                            result_head = '<a href="%s"%s>' % (
1546                            _html_escape_url(url, safe_mode=self.safe_mode), title_str)
1547                        result = '%s%s</a>' % (result_head, link_text)
1548                        if "smarty-pants" in self.extras:
1549                            result = result.replace('"', self._escape_table['"'])
1550                        # <img> allowed from curr_pos on, <a> from
1551                        # anchor_allowed_pos on.
1552                        curr_pos = start_idx + len(result_head)
1553                        anchor_allowed_pos = start_idx + len(result)
1554                        text = text[:start_idx] + result + text[url_end_idx:]
1555                    else:
1556                        # Anchor not allowed here.
1557                        curr_pos = start_idx + 1
1558                    continue
1559
1560            # Reference anchor or img?
1561            else:
1562                match = self._tail_of_reference_link_re.match(text, p)
1563                if match:
1564                    # Handle a reference-style anchor or img.
1565                    is_img = start_idx > 0 and text[start_idx - 1] == "!"
1566                    if is_img:
1567                        start_idx -= 1
1568                    link_id = match.group("id").lower()
1569                    if not link_id:
1570                        link_id = link_text.lower()  # for links like [this][]
1571                    if link_id in self.urls:
1572                        url = self.urls[link_id]
1573                        # We've got to encode these to avoid conflicting
1574                        # with italics/bold.
1575                        url = url.replace('*', self._escape_table['*']) \
1576                            .replace('_', self._escape_table['_'])
1577                        title = self.titles.get(link_id)
1578                        if title:
1579                            title = _xml_escape_attr(title) \
1580                                .replace('*', self._escape_table['*']) \
1581                                .replace('_', self._escape_table['_'])
1582                            title_str = ' title="%s"' % title
1583                        else:
1584                            title_str = ''
1585                        if is_img:
1586                            img_class_str = self._html_class_str_from_tag("img")
1587                            result = '<img src="%s" alt="%s"%s%s%s' \
1588                                     % (_html_escape_url(url, safe_mode=self.safe_mode),
1589                                        _xml_escape_attr(link_text),
1590                                        title_str,
1591                                        img_class_str,
1592                                        self.empty_element_suffix)
1593                            if "smarty-pants" in self.extras:
1594                                result = result.replace('"', self._escape_table['"'])
1595                            curr_pos = start_idx + len(result)
1596                            text = text[:start_idx] + result + text[match.end():]
1597                        elif start_idx >= anchor_allowed_pos:
1598                            if self.safe_mode and not self._safe_protocols.match(url):
1599                                result_head = '<a href="#"%s>' % (title_str)
1600                            else:
1601                                result_head = '<a href="%s"%s>' % (
1602                                _html_escape_url(url, safe_mode=self.safe_mode), title_str)
1603                            result = '%s%s</a>' % (result_head, link_text)
1604                            if "smarty-pants" in self.extras:
1605                                result = result.replace('"', self._escape_table['"'])
1606                            # <img> allowed from curr_pos on, <a> from
1607                            # anchor_allowed_pos on.
1608                            curr_pos = start_idx + len(result_head)
1609                            anchor_allowed_pos = start_idx + len(result)
1610                            text = text[:start_idx] + result + text[match.end():]
1611                        else:
1612                            # Anchor not allowed here.
1613                            curr_pos = start_idx + 1
1614                    else:
1615                        # This id isn't defined, leave the markup alone.
1616                        curr_pos = match.end()
1617                    continue
1618
1619            # Otherwise, it isn't markup.
1620            curr_pos = start_idx + 1
1621
1622        return text
1623
1624    def header_id_from_text(self, text, prefix, n):
1625        """Generate a header id attribute value from the given header
1626        HTML content.
1627
1628        This is only called if the "header-ids" extra is enabled.
1629        Subclasses may override this for different header ids.
1630
1631        @param text {str} The text of the header tag
1632        @param prefix {str} The requested prefix for header ids. This is the
1633            value of the "header-ids" extra key, if any. Otherwise, None.
1634        @param n {int} The <hN> tag number, i.e. `1` for an <h1> tag.
1635        @returns {str} The value for the header tag's "id" attribute. Return
1636            None to not have an id attribute and to exclude this header from
1637            the TOC (if the "toc" extra is specified).
1638        """
1639        header_id = _slugify(text)
1640        if prefix and isinstance(prefix, str):
1641            header_id = prefix + '-' + header_id
1642
1643        self._count_from_header_id[header_id] += 1
1644        if 0 == len(header_id) or self._count_from_header_id[header_id] > 1:
1645            header_id += '-%s' % self._count_from_header_id[header_id]
1646
1647        return header_id
1648
1649    def _toc_add_entry(self, level, id, name):
1650        if level > self._toc_depth:
1651            return
1652        if self._toc is None:
1653            self._toc = []
1654        self._toc.append((level, id, self._unescape_special_chars(name)))
1655
1656    _h_re_base = r'''
1657        (^(.+)[ \t]{0,99}\n(=+|-+)[ \t]*\n+)
1658        |
1659        (^(\#{1,6})  # \1 = string of #'s
1660        [ \t]%s
1661        (.+?)       # \2 = Header text
1662        [ \t]{0,99}
1663        (?<!\\)     # ensure not an escaped trailing '#'
1664        \#*         # optional closing #'s (not counted)
1665        \n+
1666        )
1667        '''
1668
1669    _h_re = re.compile(_h_re_base % '*', re.X | re.M)
1670    _h_re_tag_friendly = re.compile(_h_re_base % '+', re.X | re.M)
1671
1672    def _h_sub(self, match):
1673        if match.group(1) is not None and match.group(3) == "-":
1674            return match.group(1)
1675        elif match.group(1) is not None:
1676            # Setext header
1677            n = {"=": 1, "-": 2}[match.group(3)[0]]
1678            header_group = match.group(2)
1679        else:
1680            # atx header
1681            n = len(match.group(5))
1682            header_group = match.group(6)
1683
1684        demote_headers = self.extras.get("demote-headers")
1685        if demote_headers:
1686            n = min(n + demote_headers, 6)
1687        header_id_attr = ""
1688        if "header-ids" in self.extras:
1689            header_id = self.header_id_from_text(header_group,
1690                                                 self.extras["header-ids"], n)
1691            if header_id:
1692                header_id_attr = ' id="%s"' % header_id
1693        html = self._run_span_gamut(header_group)
1694        if "toc" in self.extras and header_id:
1695            self._toc_add_entry(n, header_id, html)
1696        return "<h%d%s>%s</h%d>\n\n" % (n, header_id_attr, html, n)
1697
1698    def _do_headers(self, text):
1699        # Setext-style headers:
1700        #     Header 1
1701        #     ========
1702        #
1703        #     Header 2
1704        #     --------
1705
1706        # atx-style headers:
1707        #   # Header 1
1708        #   ## Header 2
1709        #   ## Header 2 with closing hashes ##
1710        #   ...
1711        #   ###### Header 6
1712
1713        if 'tag-friendly' in self.extras:
1714            return self._h_re_tag_friendly.sub(self._h_sub, text)
1715        return self._h_re.sub(self._h_sub, text)
1716
1717    _marker_ul_chars = '*+-'
1718    _marker_any = r'(?:[%s]|\d+\.)' % _marker_ul_chars
1719    _marker_ul = '(?:[%s])' % _marker_ul_chars
1720    _marker_ol = r'(?:\d+\.)'
1721
1722    def _list_sub(self, match):
1723        lst = match.group(1)
1724        lst_type = match.group(3) in self._marker_ul_chars and "ul" or "ol"
1725        result = self._process_list_items(lst)
1726        if self.list_level:
1727            return "<%s>\n%s</%s>\n" % (lst_type, result, lst_type)
1728        else:
1729            return "<%s>\n%s</%s>\n\n" % (lst_type, result, lst_type)
1730
1731    def _do_lists(self, text):
1732        # Form HTML ordered (numbered) and unordered (bulleted) lists.
1733
1734        # Iterate over each *non-overlapping* list match.
1735        pos = 0
1736        while True:
1737            # Find the *first* hit for either list style (ul or ol). We
1738            # match ul and ol separately to avoid adjacent lists of different
1739            # types running into each other (see issue #16).
1740            hits = []
1741            for marker_pat in (self._marker_ul, self._marker_ol):
1742                less_than_tab = self.tab_width - 1
1743                whole_list = r'''
1744                    (                   # \1 = whole list
1745                      (                 # \2
1746                        [ ]{0,%d}
1747                        (%s)            # \3 = first list item marker
1748                        [ \t]+
1749                        (?!\ *\3\ )     # '- - - ...' isn't a list. See 'not_quite_a_list' test case.
1750                      )
1751                      (?:.+?)
1752                      (                 # \4
1753                          \Z
1754                        |
1755                          \n{2,}
1756                          (?=\S)
1757                          (?!           # Negative lookahead for another list item marker
1758                            [ \t]*
1759                            %s[ \t]+
1760                          )
1761                      )
1762                    )
1763                ''' % (less_than_tab, marker_pat, marker_pat)
1764                if self.list_level:  # sub-list
1765                    list_re = re.compile("^" + whole_list, re.X | re.M | re.S)
1766                else:
1767                    list_re = re.compile(r"(?:(?<=\n\n)|\A\n?)" + whole_list,
1768                                         re.X | re.M | re.S)
1769                match = list_re.search(text, pos)
1770                if match:
1771                    hits.append((match.start(), match))
1772            if not hits:
1773                break
1774            hits.sort()
1775            match = hits[0][1]
1776            start, end = match.span()
1777            middle = self._list_sub(match)
1778            text = text[:start] + middle + text[end:]
1779            pos = start + len(middle)  # start pos for next attempted match
1780
1781        return text
1782
1783    _list_item_re = re.compile(r'''
1784        (\n)?                   # leading line = \1
1785        (^[ \t]*)               # leading whitespace = \2
1786        (?P<marker>%s) [ \t]+   # list marker = \3
1787        ((?:.+?)                # list item text = \4
1788        (\n{1,2}))              # eols = \5
1789        (?= \n* (\Z | \2 (?P<next_marker>%s) [ \t]+))
1790        ''' % (_marker_any, _marker_any),
1791                               re.M | re.X | re.S)
1792
1793    _task_list_item_re = re.compile(r'''
1794        (\[[\ xX]\])[ \t]+       # tasklist marker = \1
1795        (.*)                   # list item text = \2
1796    ''', re.M | re.X | re.S)
1797
1798    _task_list_warpper_str = r'<input type="checkbox" class="task-list-item-checkbox" %sdisabled> %s'
1799
1800    def _task_list_item_sub(self, match):
1801        marker = match.group(1)
1802        item_text = match.group(2)
1803        if marker in ['[x]', '[X]']:
1804            return self._task_list_warpper_str % ('checked ', item_text)
1805        elif marker == '[ ]':
1806            return self._task_list_warpper_str % ('', item_text)
1807
1808    _last_li_endswith_two_eols = False
1809
1810    def _list_item_sub(self, match):
1811        item = match.group(4)
1812        leading_line = match.group(1)
1813        if leading_line or "\n\n" in item or self._last_li_endswith_two_eols:
1814            item = self._run_block_gamut(self._outdent(item))
1815        else:
1816            # Recursion for sub-lists:
1817            item = self._do_lists(self._uniform_outdent(item, min_outdent=' ')[1])
1818            if item.endswith('\n'):
1819                item = item[:-1]
1820            item = self._run_span_gamut(item)
1821        self._last_li_endswith_two_eols = (len(match.group(5)) == 2)
1822
1823        if "task_list" in self.extras:
1824            item = self._task_list_item_re.sub(self._task_list_item_sub, item)
1825
1826        return "<li>%s</li>\n" % item
1827
1828    def _process_list_items(self, list_str):
1829        # Process the contents of a single ordered or unordered list,
1830        # splitting it into individual list items.
1831
1832        # The $g_list_level global keeps track of when we're inside a list.
1833        # Each time we enter a list, we increment it; when we leave a list,
1834        # we decrement. If it's zero, we're not in a list anymore.
1835        #
1836        # We do this because when we're not inside a list, we want to treat
1837        # something like this:
1838        #
1839        #       I recommend upgrading to version
1840        #       8. Oops, now this line is treated
1841        #       as a sub-list.
1842        #
1843        # As a single paragraph, despite the fact that the second line starts
1844        # with a digit-period-space sequence.
1845        #
1846        # Whereas when we're inside a list (or sub-list), that line will be
1847        # treated as the start of a sub-list. What a kludge, huh? This is
1848        # an aspect of Markdown's syntax that's hard to parse perfectly
1849        # without resorting to mind-reading. Perhaps the solution is to
1850        # change the syntax rules such that sub-lists must start with a
1851        # starting cardinal number; e.g. "1." or "a.".
1852        self.list_level += 1
1853        self._last_li_endswith_two_eols = False
1854        list_str = list_str.rstrip('\n') + '\n'
1855        list_str = self._list_item_re.sub(self._list_item_sub, list_str)
1856        self.list_level -= 1
1857        return list_str
1858
1859    def _get_pygments_lexer(self, lexer_name):
1860        try:
1861            from pygments import lexers, util
1862        except ImportError:
1863            return None
1864        try:
1865            return lexers.get_lexer_by_name(lexer_name)
1866        except util.ClassNotFound:
1867            return None
1868
1869    def _color_with_pygments(self, codeblock, lexer, **formatter_opts):
1870        import pygments
1871        import pygments.formatters
1872
1873        class HtmlCodeFormatter(pygments.formatters.HtmlFormatter):
1874            def _wrap_code(self, inner):
1875                """A function for use in a Pygments Formatter which
1876                wraps in <code> tags.
1877                """
1878                yield 0, "<code>"
1879                for tup in inner:
1880                    yield tup
1881                yield 0, "</code>"
1882
1883            def _add_newline(self, inner):
1884                # Add newlines around the inner contents so that _strict_tag_block_re matches the outer div.
1885                yield 0, "\n"
1886                yield from inner
1887                yield 0, "\n"
1888
1889            def wrap(self, source, outfile=None):
1890                """Return the source with a code, pre, and div."""
1891                if outfile is None:
1892                    # pygments >= 2.12
1893                    return self._add_newline(self._wrap_pre(self._wrap_code(source)))
1894                else:
1895                    # pygments < 2.12
1896                    return self._wrap_div(self._add_newline(self._wrap_pre(self._wrap_code(source))))
1897
1898        formatter_opts.setdefault("cssclass", "codehilite")
1899        formatter = HtmlCodeFormatter(**formatter_opts)
1900        return pygments.highlight(codeblock, lexer, formatter)
1901
1902    def _code_block_sub(self, match, is_fenced_code_block=False):
1903        lexer_name = None
1904        if is_fenced_code_block:
1905            lexer_name = match.group(2)
1906            codeblock = match.group(3)
1907            codeblock = codeblock[:-1]  # drop one trailing newline
1908        else:
1909            codeblock = match.group(1)
1910            codeblock = self._outdent(codeblock)
1911            codeblock = self._detab(codeblock)
1912            codeblock = codeblock.lstrip('\n')  # trim leading newlines
1913            codeblock = codeblock.rstrip()  # trim trailing whitespace
1914
1915            # Note: "code-color" extra is DEPRECATED.
1916            if "code-color" in self.extras and codeblock.startswith(":::"):
1917                lexer_name, rest = codeblock.split('\n', 1)
1918                lexer_name = lexer_name[3:].strip()
1919                codeblock = rest.lstrip("\n")  # Remove lexer declaration line.
1920
1921        # Use pygments only if not using the highlightjs-lang extra
1922        if lexer_name and "highlightjs-lang" not in self.extras:
1923            lexer = self._get_pygments_lexer(lexer_name)
1924            if lexer:
1925                leading_indent = ' ' * (len(match.group(1)) - len(match.group(1).lstrip()))
1926                return self._code_block_with_lexer_sub(codeblock, leading_indent, lexer, is_fenced_code_block)
1927
1928        pre_class_str = self._html_class_str_from_tag("pre")
1929
1930        if "highlightjs-lang" in self.extras and lexer_name:
1931            code_class_str = ' class="%s language-%s"' % (lexer_name, lexer_name)
1932        else:
1933            code_class_str = self._html_class_str_from_tag("code")
1934
1935        if is_fenced_code_block:
1936            # Fenced code blocks need to be outdented before encoding, and then reapplied
1937            leading_indent = ' ' * (len(match.group(1)) - len(match.group(1).lstrip()))
1938            leading_indent, codeblock = self._uniform_outdent_limit(codeblock, leading_indent)
1939
1940            codeblock = self._encode_code(codeblock)
1941
1942            return "\n%s<pre%s><code%s>%s\n</code></pre>\n" % (
1943                leading_indent, pre_class_str, code_class_str, codeblock)
1944        else:
1945            codeblock = self._encode_code(codeblock)
1946
1947            return "\n<pre%s><code%s>%s\n</code></pre>\n" % (
1948                pre_class_str, code_class_str, codeblock)
1949
1950    def _code_block_with_lexer_sub(self, codeblock, leading_indent, lexer, is_fenced_code_block):
1951        if is_fenced_code_block:
1952            formatter_opts = self.extras['fenced-code-blocks'] or {}
1953        else:
1954            formatter_opts = self.extras['code-color'] or {}
1955
1956        def unhash_code(codeblock):
1957            for key, sanitized in list(self.html_spans.items()):
1958                codeblock = codeblock.replace(key, sanitized)
1959            replacements = [
1960                ("&amp;", "&"),
1961                ("&lt;", "<"),
1962                ("&gt;", ">")
1963            ]
1964            for old, new in replacements:
1965                codeblock = codeblock.replace(old, new)
1966            return codeblock
1967
1968        # remove leading indent from code block
1969        leading_indent, codeblock = self._uniform_outdent(codeblock)
1970
1971        codeblock = unhash_code(codeblock)
1972        colored = self._color_with_pygments(codeblock, lexer,
1973                                            **formatter_opts)
1974
1975        # add back the indent to all lines
1976        return "\n%s\n" % self._uniform_indent(colored, leading_indent, True)
1977
1978    def _html_class_str_from_tag(self, tag):
1979        """Get the appropriate ' class="..."' string (note the leading
1980        space), if any, for the given tag.
1981        """
1982        if "html-classes" not in self.extras:
1983            return ""
1984        try:
1985            html_classes_from_tag = self.extras["html-classes"]
1986        except TypeError:
1987            return ""
1988        else:
1989            if isinstance(html_classes_from_tag, dict):
1990                if tag in html_classes_from_tag:
1991                    return ' class="%s"' % html_classes_from_tag[tag]
1992        return ""
1993
1994    def _do_code_blocks(self, text):
1995        """Process Markdown `<pre><code>` blocks."""
1996        code_block_re = re.compile(r'''
1997            (?:\n\n|\A\n?)
1998            (               # $1 = the code block -- one or more lines, starting with a space/tab
1999              (?:
2000                (?:[ ]{%d} | \t)  # Lines must start with a tab or a tab-width of spaces
2001                .*\n+
2002              )+
2003            )
2004            ((?=^[ ]{0,%d}\S)|\Z)   # Lookahead for non-space at line-start, or end of doc
2005            # Lookahead to make sure this block isn't already in a code block.
2006            # Needed when syntax highlighting is being used.
2007            (?!([^<]|<(/?)span)*\</code\>)
2008            ''' % (self.tab_width, self.tab_width),
2009                                   re.M | re.X)
2010        return code_block_re.sub(self._code_block_sub, text)
2011
2012    _fenced_code_block_re = re.compile(r'''
2013        (?:\n+|\A\n?|(?<=\n))
2014        (^[ \t]*`{3,})\s{0,99}?([\w+-]+)?\s{0,99}?\n  # $1 = opening fence (captured for back-referencing), $2 = optional lang
2015        (.*?)                             # $3 = code block content
2016        \1[ \t]*\n                      # closing fence
2017        ''', re.M | re.X | re.S)
2018
2019    def _fenced_code_block_sub(self, match):
2020        return self._code_block_sub(match, is_fenced_code_block=True)
2021
2022    def _do_fenced_code_blocks(self, text):
2023        """Process ```-fenced unindented code blocks ('fenced-code-blocks' extra)."""
2024        return self._fenced_code_block_re.sub(self._fenced_code_block_sub, text)
2025
2026    # Rules for a code span:
2027    # - backslash escapes are not interpreted in a code span
2028    # - to include one or or a run of more backticks the delimiters must
2029    #   be a longer run of backticks
2030    # - cannot start or end a code span with a backtick; pad with a
2031    #   space and that space will be removed in the emitted HTML
2032    # See `test/tm-cases/escapes.text` for a number of edge-case
2033    # examples.
2034    _code_span_re = re.compile(r'''
2035            (?<!\\)
2036            (`+)        # \1 = Opening run of `
2037            (?!`)       # See Note A test/tm-cases/escapes.text
2038            (.+?)       # \2 = The code block
2039            (?<!`)
2040            \1          # Matching closer
2041            (?!`)
2042        ''', re.X | re.S)
2043
2044    def _code_span_sub(self, match):
2045        c = match.group(2).strip(" \t")
2046        c = self._encode_code(c)
2047        return "<code%s>%s</code>" % (self._html_class_str_from_tag("code"), c)
2048
2049    def _do_code_spans(self, text):
2050        #   *   Backtick quotes are used for <code></code> spans.
2051        #
2052        #   *   You can use multiple backticks as the delimiters if you want to
2053        #       include literal backticks in the code span. So, this input:
2054        #
2055        #         Just type ``foo `bar` baz`` at the prompt.
2056        #
2057        #       Will translate to:
2058        #
2059        #         <p>Just type <code>foo `bar` baz</code> at the prompt.</p>
2060        #
2061        #       There's no arbitrary limit to the number of backticks you
2062        #       can use as delimters. If you need three consecutive backticks
2063        #       in your code, use four for delimiters, etc.
2064        #
2065        #   *   You can use spaces to get literal backticks at the edges:
2066        #
2067        #         ... type `` `bar` `` ...
2068        #
2069        #       Turns to:
2070        #
2071        #         ... type <code>`bar`</code> ...
2072        return self._code_span_re.sub(self._code_span_sub, text)
2073
2074    def _encode_code(self, text):
2075        """Encode/escape certain characters inside Markdown code runs.
2076        The point is that in code, these characters are literals,
2077        and lose their special Markdown meanings.
2078        """
2079        replacements = [
2080            # Encode all ampersands; HTML entities are not
2081            # entities within a Markdown code span.
2082            ('&', '&amp;'),
2083            # Do the angle bracket song and dance:
2084            ('<', '&lt;'),
2085            ('>', '&gt;'),
2086        ]
2087        for before, after in replacements:
2088            text = text.replace(before, after)
2089        hashed = _hash_text(text)
2090        self._code_table[text] = hashed
2091        return hashed
2092
2093    _admonitions = r'admonition|attention|caution|danger|error|hint|important|note|tip|warning'
2094    _admonitions_re = re.compile(r'''
2095        ^(\ *)\.\.\ (%s)::\ *                # $1 leading indent, $2 the admonition
2096        (.*)?                                # $3 admonition title
2097        ((?:\s*\n\1\ {3,}.*)+?)              # $4 admonition body (required)
2098        (?=\s*(?:\Z|\n{4,}|\n\1?\ {0,2}\S))  # until EOF, 3 blank lines or something less indented
2099        ''' % _admonitions,
2100                                 re.IGNORECASE | re.MULTILINE | re.VERBOSE
2101                                 )
2102
2103    def _do_admonitions_sub(self, match):
2104        lead_indent, admonition_name, title, body = match.groups()
2105
2106        admonition_type = '<strong>%s</strong>' % admonition_name
2107
2108        # figure out the class names to assign the block
2109        if admonition_name.lower() == 'admonition':
2110            admonition_class = 'admonition'
2111        else:
2112            admonition_class = 'admonition %s' % admonition_name.lower()
2113
2114        # titles are generally optional
2115        if title:
2116            title = '<em>%s</em>' % title
2117
2118        # process the admonition body like regular markdown
2119        body = self._run_block_gamut("\n%s\n" % self._uniform_outdent(body)[1])
2120
2121        # indent the body before placing inside the aside block
2122        admonition = self._uniform_indent('%s\n%s\n\n%s\n' % (admonition_type, title, body), self.tab, False)
2123        # wrap it in an aside
2124        admonition = '<aside class="%s">\n%s</aside>' % (admonition_class, admonition)
2125        # now indent the whole admonition back to where it started
2126        return self._uniform_indent(admonition, lead_indent, False)
2127
2128    def _do_admonitions(self, text):
2129        return self._admonitions_re.sub(self._do_admonitions_sub, text)
2130
2131    _strike_re = re.compile(r"~~(?=\S)(.+?)(?<=\S)~~", re.S)
2132
2133    def _do_strike(self, text):
2134        text = self._strike_re.sub(r"<s>\1</s>", text)
2135        return text
2136
2137    _underline_re = re.compile(r"(?<!<!)--(?!>)(?=\S)(.+?)(?<=\S)(?<!<!)--(?!>)", re.S)
2138
2139    def _do_underline(self, text):
2140        text = self._underline_re.sub(r"<u>\1</u>", text)
2141        return text
2142
2143    _strong_re = re.compile(r"(\*\*|__)(?=\S)(.+?[*_]*)(?<=\S)\1", re.S)
2144    _em_re = re.compile(r"(\*|_)(?=\S)(.+?)(?<=\S)\1", re.S)
2145    _code_friendly_strong_re = re.compile(r"\*\*(?=\S)(.+?[*_]*)(?<=\S)\*\*", re.S)
2146    _code_friendly_em_re = re.compile(r"\*(?=\S)(.+?)(?<=\S)\*", re.S)
2147
2148    def _do_italics_and_bold(self, text):
2149        # <strong> must go first:
2150        if "code-friendly" in self.extras:
2151            text = self._code_friendly_strong_re.sub(r"<strong>\1</strong>", text)
2152            text = self._code_friendly_em_re.sub(r"<em>\1</em>", text)
2153        else:
2154            text = self._strong_re.sub(r"<strong>\2</strong>", text)
2155            text = self._em_re.sub(r"<em>\2</em>", text)
2156        return text
2157
2158    # "smarty-pants" extra: Very liberal in interpreting a single prime as an
2159    # apostrophe; e.g. ignores the fact that "round", "bout", "twer", and
2160    # "twixt" can be written without an initial apostrophe. This is fine because
2161    # using scare quotes (single quotation marks) is rare.
2162    _apostrophe_year_re = re.compile(r"'(\d\d)(?=(\s|,|;|\.|\?|!|$))")
2163    _contractions = ["tis", "twas", "twer", "neath", "o", "n",
2164                     "round", "bout", "twixt", "nuff", "fraid", "sup"]
2165
2166    def _do_smart_contractions(self, text):
2167        text = self._apostrophe_year_re.sub(r"&#8217;\1", text)
2168        for c in self._contractions:
2169            text = text.replace("'%s" % c, "&#8217;%s" % c)
2170            text = text.replace("'%s" % c.capitalize(),
2171                                "&#8217;%s" % c.capitalize())
2172        return text
2173
2174    # Substitute double-quotes before single-quotes.
2175    _opening_single_quote_re = re.compile(r"(?<!\S)'(?=\S)")
2176    _opening_double_quote_re = re.compile(r'(?<!\S)"(?=\S)')
2177    _closing_single_quote_re = re.compile(r"(?<=\S)'")
2178    _closing_double_quote_re = re.compile(r'(?<=\S)"(?=(\s|,|;|\.|\?|!|$))')
2179
2180    def _do_smart_punctuation(self, text):
2181        """Fancifies 'single quotes', "double quotes", and apostrophes.
2182        Converts --, ---, and ... into en dashes, em dashes, and ellipses.
2183
2184        Inspiration is: <http://daringfireball.net/projects/smartypants/>
2185        See "test/tm-cases/smarty_pants.text" for a full discussion of the
2186        support here and
2187        <http://code.google.com/p/python-markdown2/issues/detail?id=42> for a
2188        discussion of some diversion from the original SmartyPants.
2189        """
2190        if "'" in text:  # guard for perf
2191            text = self._do_smart_contractions(text)
2192            text = self._opening_single_quote_re.sub("&#8216;", text)
2193            text = self._closing_single_quote_re.sub("&#8217;", text)
2194
2195        if '"' in text:  # guard for perf
2196            text = self._opening_double_quote_re.sub("&#8220;", text)
2197            text = self._closing_double_quote_re.sub("&#8221;", text)
2198
2199        text = text.replace("---", "&#8212;")
2200        text = text.replace("--", "&#8211;")
2201        text = text.replace("...", "&#8230;")
2202        text = text.replace(" . . . ", "&#8230;")
2203        text = text.replace(". . .", "&#8230;")
2204
2205        # TODO: Temporary hack to fix https://github.com/trentm/python-markdown2/issues/150
2206        if "footnotes" in self.extras and "footnote-ref" in text:
2207            # Quotes in the footnote back ref get converted to "smart" quotes
2208            # Change them back here to ensure they work.
2209            text = text.replace('class="footnote-ref&#8221;', 'class="footnote-ref"')
2210
2211        return text
2212
2213    _block_quote_base = r'''
2214        (                           # Wrap whole match in \1
2215          (
2216            ^[ \t]*>%s[ \t]?        # '>' at the start of a line
2217              .+\n                  # rest of the first line
2218            (.+\n)*                 # subsequent consecutive lines
2219          )+
2220        )
2221    '''
2222    _block_quote_re = re.compile(_block_quote_base % '', re.M | re.X)
2223    _block_quote_re_spoiler = re.compile(_block_quote_base % '[ \t]*?!?', re.M | re.X)
2224    _bq_one_level_re = re.compile('^[ \t]*>[ \t]?', re.M)
2225    _bq_one_level_re_spoiler = re.compile('^[ \t]*>[ \t]*?![ \t]?', re.M)
2226    _bq_all_lines_spoilers = re.compile(r'\A(?:^[ \t]*>[ \t]*?!.*[\n\r]*)+\Z', re.M)
2227    _html_pre_block_re = re.compile(r'(\s*<pre>.+?</pre>)', re.S)
2228
2229    def _dedent_two_spaces_sub(self, match):
2230        return re.sub(r'(?m)^  ', '', match.group(1))
2231
2232    def _block_quote_sub(self, match):
2233        bq = match.group(1)
2234        is_spoiler = 'spoiler' in self.extras and self._bq_all_lines_spoilers.match(bq)
2235        # trim one level of quoting
2236        if is_spoiler:
2237            bq = self._bq_one_level_re_spoiler.sub('', bq)
2238        else:
2239            bq = self._bq_one_level_re.sub('', bq)
2240        # trim whitespace-only lines
2241        bq = self._ws_only_line_re.sub('', bq)
2242        bq = self._run_block_gamut(bq)  # recurse
2243
2244        bq = re.sub('(?m)^', '  ', bq)
2245        # These leading spaces screw with <pre> content, so we need to fix that:
2246        bq = self._html_pre_block_re.sub(self._dedent_two_spaces_sub, bq)
2247
2248        if is_spoiler:
2249            return '<blockquote class="spoiler">\n%s\n</blockquote>\n\n' % bq
2250        else:
2251            return '<blockquote>\n%s\n</blockquote>\n\n' % bq
2252
2253    def _do_block_quotes(self, text):
2254        if '>' not in text:
2255            return text
2256        if 'spoiler' in self.extras:
2257            return self._block_quote_re_spoiler.sub(self._block_quote_sub, text)
2258        else:
2259            return self._block_quote_re.sub(self._block_quote_sub, text)
2260
2261    def _form_paragraphs(self, text):
2262        # Strip leading and trailing lines:
2263        text = text.strip('\n')
2264
2265        # Wrap <p> tags.
2266        grafs = []
2267        for i, graf in enumerate(re.split(r"\n{2,}", text)):
2268            if graf in self.html_blocks:
2269                # Unhashify HTML blocks
2270                grafs.append(self.html_blocks[graf])
2271            else:
2272                cuddled_list = None
2273                if "cuddled-lists" in self.extras:
2274                    # Need to put back trailing '\n' for `_list_item_re`
2275                    # match at the end of the paragraph.
2276                    li = self._list_item_re.search(graf + '\n')
2277                    # Two of the same list marker in this paragraph: a likely
2278                    # candidate for a list cuddled to preceding paragraph
2279                    # text (issue 33). Note the `[-1]` is a quick way to
2280                    # consider numeric bullets (e.g. "1." and "2.") to be
2281                    # equal.
2282                    if (li and len(li.group(2)) <= 3
2283                        and (
2284                            (li.group("next_marker") and li.group("marker")[-1] == li.group("next_marker")[-1])
2285                            or
2286                            li.group("next_marker") is None
2287                        )
2288                    ):
2289                        start = li.start()
2290                        cuddled_list = self._do_lists(graf[start:]).rstrip("\n")
2291                        assert cuddled_list.startswith("<ul>") or cuddled_list.startswith("<ol>")
2292                        graf = graf[:start]
2293
2294                # Wrap <p> tags.
2295                graf = self._run_span_gamut(graf)
2296                grafs.append("<p%s>" % self._html_class_str_from_tag('p') + graf.lstrip(" \t") + "</p>")
2297
2298                if cuddled_list:
2299                    grafs.append(cuddled_list)
2300
2301        return "\n\n".join(grafs)
2302
2303    def _add_footnotes(self, text):
2304        if self.footnotes:
2305            footer = [
2306                '<div class="footnotes">',
2307                '<hr' + self.empty_element_suffix,
2308                '<ol>',
2309            ]
2310
2311            if not self.footnote_title:
2312                self.footnote_title = "Jump back to footnote %d in the text."
2313            if not self.footnote_return_symbol:
2314                self.footnote_return_symbol = "&#8617;"
2315
2316            for i, id in enumerate(self.footnote_ids):
2317                if i != 0:
2318                    footer.append('')
2319                footer.append('<li id="fn-%s">' % id)
2320                footer.append(self._run_block_gamut(self.footnotes[id]))
2321                try:
2322                    backlink = ('<a href="#fnref-%s" ' +
2323                                'class="footnoteBackLink" ' +
2324                                'title="' + self.footnote_title + '">' +
2325                                self.footnote_return_symbol +
2326                                '</a>') % (id, i + 1)
2327                except TypeError:
2328                    log.debug("Footnote error. `footnote_title` "
2329                              "must include parameter. Using defaults.")
2330                    backlink = ('<a href="#fnref-%s" '
2331                                'class="footnoteBackLink" '
2332                                'title="Jump back to footnote %d in the text.">'
2333                                '&#8617;</a>' % (id, i + 1))
2334
2335                if footer[-1].endswith("</p>"):
2336                    footer[-1] = footer[-1][:-len("</p>")] \
2337                                 + '&#160;' + backlink + "</p>"
2338                else:
2339                    footer.append("\n<p>%s</p>" % backlink)
2340                footer.append('</li>')
2341            footer.append('</ol>')
2342            footer.append('</div>')
2343            return text + '\n\n' + '\n'.join(footer)
2344        else:
2345            return text
2346
2347    _naked_lt_re = re.compile(r'<(?![a-z/?\$!])', re.I)
2348    _naked_gt_re = re.compile(r'''(?<![a-z0-9?!/'"-])>''', re.I)
2349
2350    def _encode_amps_and_angles(self, text):
2351        # Smart processing for ampersands and angle brackets that need
2352        # to be encoded.
2353        text = _AMPERSAND_RE.sub('&amp;', text)
2354
2355        # Encode naked <'s
2356        text = self._naked_lt_re.sub('&lt;', text)
2357
2358        # Encode naked >'s
2359        # Note: Other markdown implementations (e.g. Markdown.pl, PHP
2360        # Markdown) don't do this.
2361        text = self._naked_gt_re.sub('&gt;', text)
2362        return text
2363
2364    _incomplete_tags_re = re.compile(r"<(/?\w+?(?!\w)\s*?.+?[\s/]+?)")
2365
2366    def _encode_incomplete_tags(self, text):
2367        if self.safe_mode not in ("replace", "escape"):
2368            return text
2369
2370        if text.endswith(">"):
2371            return text  # this is not an incomplete tag, this is a link in the form <http://x.y.z>
2372
2373        return self._incomplete_tags_re.sub("&lt;\\1", text)
2374
2375    def _encode_backslash_escapes(self, text):
2376        for ch, escape in list(self._escape_table.items()):
2377            text = text.replace("\\" + ch, escape)
2378        return text
2379
2380    _auto_link_re = re.compile(r'<((https?|ftp):[^\'">\s]+)>', re.I)
2381
2382    def _auto_link_sub(self, match):
2383        g1 = match.group(1)
2384        return '<a href="%s">%s</a>' % (g1, g1)
2385
2386    _auto_email_link_re = re.compile(r"""
2387          <
2388           (?:mailto:)?
2389          (
2390              [-.\w]+
2391              \@
2392              [-\w]+(\.[-\w]+)*\.[a-z]+
2393          )
2394          >
2395        """, re.I | re.X | re.U)
2396
2397    def _auto_email_link_sub(self, match):
2398        return self._encode_email_address(
2399            self._unescape_special_chars(match.group(1)))
2400
2401    def _do_auto_links(self, text):
2402        text = self._auto_link_re.sub(self._auto_link_sub, text)
2403        text = self._auto_email_link_re.sub(self._auto_email_link_sub, text)
2404        return text
2405
2406    def _encode_email_address(self, addr):
2407        #  Input: an email address, e.g. "foo@example.com"
2408        #
2409        #  Output: the email address as a mailto link, with each character
2410        #      of the address encoded as either a decimal or hex entity, in
2411        #      the hopes of foiling most address harvesting spam bots. E.g.:
2412        #
2413        #    <a href="&#x6D;&#97;&#105;&#108;&#x74;&#111;:&#102;&#111;&#111;&#64;&#101;
2414        #       x&#x61;&#109;&#x70;&#108;&#x65;&#x2E;&#99;&#111;&#109;">&#102;&#111;&#111;
2415        #       &#64;&#101;x&#x61;&#109;&#x70;&#108;&#x65;&#x2E;&#99;&#111;&#109;</a>
2416        #
2417        #  Based on a filter by Matthew Wickline, posted to the BBEdit-Talk
2418        #  mailing list: <http://tinyurl.com/yu7ue>
2419        chars = [_xml_encode_email_char_at_random(ch)
2420                 for ch in "mailto:" + addr]
2421        # Strip the mailto: from the visible part.
2422        addr = '<a href="%s">%s</a>' \
2423               % (''.join(chars), ''.join(chars[7:]))
2424        return addr
2425
2426    _basic_link_re = re.compile(r'!?\[.*?\]\(.*?\)')
2427
2428    def _do_link_patterns(self, text):
2429        link_from_hash = {}
2430        for regex, repl in self.link_patterns:
2431            replacements = []
2432            for match in regex.finditer(text):
2433                if hasattr(repl, "__call__"):
2434                    href = repl(match)
2435                else:
2436                    href = match.expand(repl)
2437                replacements.append((match.span(), href))
2438            for (start, end), href in reversed(replacements):
2439
2440                # Do not match against links inside brackets.
2441                if text[start - 1:start] == '[' and text[end:end + 1] == ']':
2442                    continue
2443
2444                # Do not match against links in the standard markdown syntax.
2445                if text[start - 2:start] == '](' or text[end:end + 2] == '")':
2446                    continue
2447
2448                # Do not match against links which are escaped.
2449                if text[start - 3:start] == '"""' and text[end:end + 3] == '"""':
2450                    text = text[:start - 3] + text[start:end] + text[end + 3:]
2451                    continue
2452
2453                # search the text for anything that looks like a link
2454                is_inside_link = False
2455                for link_re in (self._auto_link_re, self._basic_link_re):
2456                    for match in link_re.finditer(text):
2457                        if any((r[0] <= start and end <= r[1]) for r in match.regs):
2458                            # if the link pattern start and end pos is within the bounds of
2459                            # something that looks like a link, then don't process it
2460                            is_inside_link = True
2461                            break
2462                    else:
2463                        continue
2464                    break
2465
2466                if is_inside_link:
2467                    continue
2468
2469                escaped_href = (
2470                    href.replace('"', '&quot;')  # b/c of attr quote
2471                    # To avoid markdown <em> and <strong>:
2472                    .replace('*', self._escape_table['*'])
2473                    .replace('_', self._escape_table['_']))
2474                link = '<a href="%s">%s</a>' % (escaped_href, text[start:end])
2475                hash = _hash_text(link)
2476                link_from_hash[hash] = link
2477                text = text[:start] + hash + text[end:]
2478        for hash, link in list(link_from_hash.items()):
2479            text = text.replace(hash, link)
2480        return text
2481
2482    def _unescape_special_chars(self, text):
2483        # Swap back in all the special characters we've hidden.
2484        for ch, hash in list(self._escape_table.items()) + list(self._code_table.items()):
2485            text = text.replace(hash, ch)
2486        return text
2487
2488    def _outdent(self, text):
2489        # Remove one level of line-leading tabs or spaces
2490        return self._outdent_re.sub('', text)
2491
2492    def _uniform_outdent(self, text, min_outdent=None):
2493        # Removes the smallest common leading indentation from each line
2494        # of `text` and returns said indent along with the outdented text.
2495        # The `min_outdent` kwarg only outdents lines that start with at
2496        # least this level of indentation or more.
2497
2498        # Find leading indentation of each line
2499        ws = re.findall(r'(^[ \t]*)(?:[^ \t\n])', text, re.MULTILINE)
2500        # Sort the indents within bounds
2501        if min_outdent:
2502            # dont use "is not None" here so we avoid iterating over ws
2503            # if min_outdent == '', which would do nothing
2504            ws = [i for i in ws if len(min_outdent) <= len(i)]
2505        if not ws:
2506            return '', text
2507        # Get smallest common leading indent
2508        ws = sorted(ws)[0]
2509        # Dedent every line by smallest common indent
2510        return ws, ''.join(
2511            (line.replace(ws, '', 1) if line.startswith(ws) else line)
2512            for line in text.splitlines(True)
2513        )
2514
2515    def _uniform_outdent_limit(self, text, outdent):
2516        # Outdents up to `outdent`. Similar to `_uniform_outdent`, but
2517        # will leave some indentation on the line with the smallest common
2518        # leading indentation depending on the amount specified.
2519        # If the smallest leading indentation is less than `outdent`, it will
2520        # perform identical to `_uniform_outdent`
2521
2522        # Find leading indentation of each line
2523        ws = re.findall(r'(^[ \t]*)(?:[^ \t\n])', text, re.MULTILINE)
2524        if not ws:
2525            return outdent, text
2526        # Get smallest common leading indent
2527        ws = sorted(ws)[0]
2528        if len(outdent) > len(ws):
2529            outdent = ws
2530        return outdent, ''.join(
2531            (line.replace(outdent, '', 1) if line.startswith(outdent) else line)
2532            for line in text.splitlines(True)
2533        )
2534
2535    def _uniform_indent(self, text, indent, include_empty_lines=False):
2536        return ''.join(
2537            (indent + line if line.strip() or include_empty_lines else '')
2538            for line in text.splitlines(True)
2539        )
2540
2541
2542class MarkdownWithExtras(Markdown):
2543    """A markdowner class that enables most extras:
2544
2545    - footnotes
2546    - code-color (only has effect if 'pygments' Python module on path)
2547
2548    These are not included:
2549    - pyshell (specific to Python-related documenting)
2550    - code-friendly (because it *disables* part of the syntax)
2551    - link-patterns (because you need to specify some actual
2552      link-patterns anyway)
2553    """
2554    extras = ["footnotes", "code-color"]
2555
2556
2557# ---- internal support functions
2558
2559
2560def calculate_toc_html(toc):
2561    """Return the HTML for the current TOC.
2562
2563    This expects the `_toc` attribute to have been set on this instance.
2564    """
2565    if toc is None:
2566        return None
2567
2568    def indent():
2569        return '  ' * (len(h_stack) - 1)
2570
2571    lines = []
2572    h_stack = [0]  # stack of header-level numbers
2573    for level, id, name in toc:
2574        if level > h_stack[-1]:
2575            lines.append("%s<ul>" % indent())
2576            h_stack.append(level)
2577        elif level == h_stack[-1]:
2578            lines[-1] += "</li>"
2579        else:
2580            while level < h_stack[-1]:
2581                h_stack.pop()
2582                if not lines[-1].endswith("</li>"):
2583                    lines[-1] += "</li>"
2584                lines.append("%s</ul></li>" % indent())
2585        lines.append('%s<li><a href="#%s">%s</a>' % (
2586            indent(), id, name))
2587    while len(h_stack) > 1:
2588        h_stack.pop()
2589        if not lines[-1].endswith("</li>"):
2590            lines[-1] += "</li>"
2591        lines.append("%s</ul>" % indent())
2592    return '\n'.join(lines) + '\n'
2593
2594
2595class UnicodeWithAttrs(str):
2596    """A subclass of unicode used for the return value of conversion to
2597    possibly attach some attributes. E.g. the "toc_html" attribute when
2598    the "toc" extra is used.
2599    """
2600    metadata = None
2601    toc_html = None
2602
2603
2604## {{{ http://code.activestate.com/recipes/577257/ (r1)
2605_slugify_strip_re = re.compile(r'[^\w\s-]')
2606_slugify_hyphenate_re = re.compile(r'[-\s]+')
2607
2608
2609def _slugify(value):
2610    """
2611    Normalizes string, converts to lowercase, removes non-alpha characters,
2612    and converts spaces to hyphens.
2613
2614    From Django's "django/template/defaultfilters.py".
2615    """
2616    import unicodedata
2617    value = unicodedata.normalize('NFKD', value).encode('ascii', 'ignore').decode()
2618    value = _slugify_strip_re.sub('', value).strip().lower()
2619    return _slugify_hyphenate_re.sub('-', value)
2620
2621
2622## end of http://code.activestate.com/recipes/577257/ }}}
2623
2624
2625# From http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/52549
2626def _curry(*args, **kwargs):
2627    function, args = args[0], args[1:]
2628
2629    def result(*rest, **kwrest):
2630        combined = kwargs.copy()
2631        combined.update(kwrest)
2632        return function(*args + rest, **combined)
2633
2634    return result
2635
2636
2637# Recipe: regex_from_encoded_pattern (1.0)
2638def _regex_from_encoded_pattern(s):
2639    """'foo'    -> re.compile(re.escape('foo'))
2640       '/foo/'  -> re.compile('foo')
2641       '/foo/i' -> re.compile('foo', re.I)
2642    """
2643    if s.startswith('/') and s.rfind('/') != 0:
2644        # Parse it: /PATTERN/FLAGS
2645        idx = s.rfind('/')
2646        _, flags_str = s[1:idx], s[idx + 1:]
2647        flag_from_char = {
2648            "i": re.IGNORECASE,
2649            "l": re.LOCALE,
2650            "s": re.DOTALL,
2651            "m": re.MULTILINE,
2652            "u": re.UNICODE,
2653        }
2654        flags = 0
2655        for char in flags_str:
2656            try:
2657                flags |= flag_from_char[char]
2658            except KeyError:
2659                raise ValueError("unsupported regex flag: '%s' in '%s' "
2660                                 "(must be one of '%s')"
2661                                 % (char, s, ''.join(list(flag_from_char.keys()))))
2662        return re.compile(s[1:idx], flags)
2663    else:  # not an encoded regex
2664        return re.compile(re.escape(s))
2665
2666
2667# Recipe: dedent (0.1.2)
2668def _dedentlines(lines, tabsize=8, skip_first_line=False):
2669    """_dedentlines(lines, tabsize=8, skip_first_line=False) -> dedented lines
2670
2671        "lines" is a list of lines to dedent.
2672        "tabsize" is the tab width to use for indent width calculations.
2673        "skip_first_line" is a boolean indicating if the first line should
2674            be skipped for calculating the indent width and for dedenting.
2675            This is sometimes useful for docstrings and similar.
2676
2677    Same as dedent() except operates on a sequence of lines. Note: the
2678    lines list is modified **in-place**.
2679    """
2680    DEBUG = False
2681    if DEBUG:
2682        print("dedent: dedent(..., tabsize=%d, skip_first_line=%r)" \
2683              % (tabsize, skip_first_line))
2684    margin = None
2685    for i, line in enumerate(lines):
2686        if i == 0 and skip_first_line: continue
2687        indent = 0
2688        for ch in line:
2689            if ch == ' ':
2690                indent += 1
2691            elif ch == '\t':
2692                indent += tabsize - (indent % tabsize)
2693            elif ch in '\r\n':
2694                continue  # skip all-whitespace lines
2695            else:
2696                break
2697        else:
2698            continue  # skip all-whitespace lines
2699        if DEBUG: print("dedent: indent=%d: %r" % (indent, line))
2700        if margin is None:
2701            margin = indent
2702        else:
2703            margin = min(margin, indent)
2704    if DEBUG: print("dedent: margin=%r" % margin)
2705
2706    if margin is not None and margin > 0:
2707        for i, line in enumerate(lines):
2708            if i == 0 and skip_first_line: continue
2709            removed = 0
2710            for j, ch in enumerate(line):
2711                if ch == ' ':
2712                    removed += 1
2713                elif ch == '\t':
2714                    removed += tabsize - (removed % tabsize)
2715                elif ch in '\r\n':
2716                    if DEBUG: print("dedent: %r: EOL -> strip up to EOL" % line)
2717                    lines[i] = lines[i][j:]
2718                    break
2719                else:
2720                    raise ValueError("unexpected non-whitespace char %r in "
2721                                     "line %r while removing %d-space margin"
2722                                     % (ch, line, margin))
2723                if DEBUG:
2724                    print("dedent: %r: %r -> removed %d/%d" \
2725                          % (line, ch, removed, margin))
2726                if removed == margin:
2727                    lines[i] = lines[i][j + 1:]
2728                    break
2729                elif removed > margin:
2730                    lines[i] = ' ' * (removed - margin) + lines[i][j + 1:]
2731                    break
2732            else:
2733                if removed:
2734                    lines[i] = lines[i][removed:]
2735    return lines
2736
2737
2738def _dedent(text, tabsize=8, skip_first_line=False):
2739    """_dedent(text, tabsize=8, skip_first_line=False) -> dedented text
2740
2741        "text" is the text to dedent.
2742        "tabsize" is the tab width to use for indent width calculations.
2743        "skip_first_line" is a boolean indicating if the first line should
2744            be skipped for calculating the indent width and for dedenting.
2745            This is sometimes useful for docstrings and similar.
2746
2747    textwrap.dedent(s), but don't expand tabs to spaces
2748    """
2749    lines = text.splitlines(1)
2750    _dedentlines(lines, tabsize=tabsize, skip_first_line=skip_first_line)
2751    return ''.join(lines)
2752
2753
2754class _memoized(object):
2755    """Decorator that caches a function's return value each time it is called.
2756    If called later with the same arguments, the cached value is returned, and
2757    not re-evaluated.
2758
2759    http://wiki.python.org/moin/PythonDecoratorLibrary
2760    """
2761
2762    def __init__(self, func):
2763        self.func = func
2764        self.cache = {}
2765
2766    def __call__(self, *args):
2767        try:
2768            return self.cache[args]
2769        except KeyError:
2770            self.cache[args] = value = self.func(*args)
2771            return value
2772        except TypeError:
2773            # uncachable -- for instance, passing a list as an argument.
2774            # Better to not cache than to blow up entirely.
2775            return self.func(*args)
2776
2777    def __repr__(self):
2778        """Return the function's docstring."""
2779        return self.func.__doc__
2780
2781
2782def _xml_oneliner_re_from_tab_width(tab_width):
2783    """Standalone XML processing instruction regex."""
2784    return re.compile(r"""
2785        (?:
2786            (?<=\n\n)       # Starting after a blank line
2787            |               # or
2788            \A\n?           # the beginning of the doc
2789        )
2790        (                           # save in $1
2791            [ ]{0,%d}
2792            (?:
2793                <\?\w+\b\s+.*?\?>   # XML processing instruction
2794                |
2795                <\w+:\w+\b\s+.*?/>  # namespaced single tag
2796            )
2797            [ \t]*
2798            (?=\n{2,}|\Z)       # followed by a blank line or end of document
2799        )
2800        """ % (tab_width - 1), re.X)
2801
2802
2803_xml_oneliner_re_from_tab_width = _memoized(_xml_oneliner_re_from_tab_width)
2804
2805
2806def _hr_tag_re_from_tab_width(tab_width):
2807    return re.compile(r"""
2808        (?:
2809            (?<=\n\n)       # Starting after a blank line
2810            |               # or
2811            \A\n?           # the beginning of the doc
2812        )
2813        (                       # save in \1
2814            [ ]{0,%d}
2815            <(hr)               # start tag = \2
2816            \b                  # word break
2817            ([^<>])*?           #
2818            /?>                 # the matching end tag
2819            [ \t]*
2820            (?=\n{2,}|\Z)       # followed by a blank line or end of document
2821        )
2822        """ % (tab_width - 1), re.X)
2823
2824
2825_hr_tag_re_from_tab_width = _memoized(_hr_tag_re_from_tab_width)
2826
2827
2828def _xml_escape_attr(attr, skip_single_quote=True):
2829    """Escape the given string for use in an HTML/XML tag attribute.
2830
2831    By default this doesn't bother with escaping `'` to `&#39;`, presuming that
2832    the tag attribute is surrounded by double quotes.
2833    """
2834    escaped = _AMPERSAND_RE.sub('&amp;', attr)
2835
2836    escaped = (attr
2837               .replace('"', '&quot;')
2838               .replace('<', '&lt;')
2839               .replace('>', '&gt;'))
2840    if not skip_single_quote:
2841        escaped = escaped.replace("'", "&#39;")
2842    return escaped
2843
2844
2845def _xml_encode_email_char_at_random(ch):
2846    r = random()
2847    # Roughly 10% raw, 45% hex, 45% dec.
2848    # '@' *must* be encoded. I [John Gruber] insist.
2849    # Issue 26: '_' must be encoded.
2850    if r > 0.9 and ch not in "@_":
2851        return ch
2852    elif r < 0.45:
2853        # The [1:] is to drop leading '0': 0x63 -> x63
2854        return '&#%s;' % hex(ord(ch))[1:]
2855    else:
2856        return '&#%s;' % ord(ch)
2857
2858
2859def _html_escape_url(attr, safe_mode=False):
2860    """Replace special characters that are potentially malicious in url string."""
2861    escaped = (attr
2862               .replace('"', '&quot;')
2863               .replace('<', '&lt;')
2864               .replace('>', '&gt;'))
2865    if safe_mode:
2866        escaped = escaped.replace('+', ' ')
2867        escaped = escaped.replace("'", "&#39;")
2868    return escaped
2869
2870
2871# ---- mainline
2872
2873class _NoReflowFormatter(optparse.IndentedHelpFormatter):
2874    """An optparse formatter that does NOT reflow the description."""
2875
2876    def format_description(self, description):
2877        return description or ""
2878
2879
2880def _test():
2881    import doctest
2882    doctest.testmod()
2883
2884
2885def main(argv=None):
2886    if argv is None:
2887        argv = sys.argv
2888    if not logging.root.handlers:
2889        logging.basicConfig()
2890
2891    usage = "usage: %prog [PATHS...]"
2892    version = "%prog " + __version__
2893    parser = optparse.OptionParser(prog="markdown2", usage=usage,
2894                                   version=version, description=cmdln_desc,
2895                                   formatter=_NoReflowFormatter())
2896    parser.add_option("-v", "--verbose", dest="log_level",
2897                      action="store_const", const=logging.DEBUG,
2898                      help="more verbose output")
2899    parser.add_option("--encoding",
2900                      help="specify encoding of text content")
2901    parser.add_option("--html4tags", action="store_true", default=False,
2902                      help="use HTML 4 style for empty element tags")
2903    parser.add_option("-s", "--safe", metavar="MODE", dest="safe_mode",
2904                      help="sanitize literal HTML: 'escape' escapes "
2905                           "HTML meta chars, 'replace' replaces with an "
2906                           "[HTML_REMOVED] note")
2907    parser.add_option("-x", "--extras", action="append",
2908                      help="Turn on specific extra features (not part of "
2909                           "the core Markdown spec). See above.")
2910    parser.add_option("--use-file-vars",
2911                      help="Look for and use Emacs-style 'markdown-extras' "
2912                           "file var to turn on extras. See "
2913                           "<https://github.com/trentm/python-markdown2/wiki/Extras>")
2914    parser.add_option("--link-patterns-file",
2915                      help="path to a link pattern file")
2916    parser.add_option("--self-test", action="store_true",
2917                      help="run internal self-tests (some doctests)")
2918    parser.add_option("--compare", action="store_true",
2919                      help="run against Markdown.pl as well (for testing)")
2920    parser.set_defaults(log_level=logging.INFO, compare=False,
2921                        encoding="utf-8", safe_mode=None, use_file_vars=False)
2922    opts, paths = parser.parse_args()
2923    log.setLevel(opts.log_level)
2924
2925    if opts.self_test:
2926        return _test()
2927
2928    if opts.extras:
2929        extras = {}
2930        for s in opts.extras:
2931            splitter = re.compile("[,;: ]+")
2932            for e in splitter.split(s):
2933                if '=' in e:
2934                    ename, earg = e.split('=', 1)
2935                    try:
2936                        earg = int(earg)
2937                    except ValueError:
2938                        pass
2939                else:
2940                    ename, earg = e, None
2941                extras[ename] = earg
2942    else:
2943        extras = None
2944
2945    if opts.link_patterns_file:
2946        link_patterns = []
2947        f = open(opts.link_patterns_file)
2948        try:
2949            for i, line in enumerate(f.readlines()):
2950                if not line.strip(): continue
2951                if line.lstrip().startswith("#"): continue
2952                try:
2953                    pat, href = line.rstrip().rsplit(None, 1)
2954                except ValueError:
2955                    raise MarkdownError("%s:%d: invalid link pattern line: %r"
2956                                        % (opts.link_patterns_file, i + 1, line))
2957                link_patterns.append(
2958                    (_regex_from_encoded_pattern(pat), href))
2959        finally:
2960            f.close()
2961    else:
2962        link_patterns = None
2963
2964    from os.path import join, dirname, abspath, exists
2965    markdown_pl = join(dirname(dirname(abspath(__file__))), "test",
2966                       "Markdown.pl")
2967    if not paths:
2968        paths = ['-']
2969    for path in paths:
2970        if path == '-':
2971            text = sys.stdin.read()
2972        else:
2973            fp = codecs.open(path, 'r', opts.encoding)
2974            text = fp.read()
2975            fp.close()
2976        if opts.compare:
2977            from subprocess import Popen, PIPE
2978            print("==== Markdown.pl ====")
2979            p = Popen('perl %s' % markdown_pl, shell=True, stdin=PIPE, stdout=PIPE, close_fds=True)
2980            p.stdin.write(text.encode('utf-8'))
2981            p.stdin.close()
2982            perl_html = p.stdout.read().decode('utf-8')
2983            sys.stdout.write(perl_html)
2984            print("==== markdown2.py ====")
2985        html = markdown(text,
2986                        html4tags=opts.html4tags,
2987                        safe_mode=opts.safe_mode,
2988                        extras=extras, link_patterns=link_patterns,
2989                        use_file_vars=opts.use_file_vars,
2990                        cli=True)
2991        sys.stdout.write(html)
2992        if extras and "toc" in extras:
2993            log.debug("toc_html: " +
2994                      str(html.toc_html.encode(sys.stdout.encoding or "utf-8", 'xmlcharrefreplace')))
2995        if opts.compare:
2996            test_dir = join(dirname(dirname(abspath(__file__))), "test")
2997            if exists(join(test_dir, "test_markdown2.py")):
2998                sys.path.insert(0, test_dir)
2999                from test_markdown2 import norm_html_from_html
3000                norm_html = norm_html_from_html(html)
3001                norm_perl_html = norm_html_from_html(perl_html)
3002            else:
3003                norm_html = html
3004                norm_perl_html = perl_html
3005            print("==== match? %r ====" % (norm_perl_html == norm_html))
3006
3007
3008if __name__ == "__main__":
3009    sys.exit(main(sys.argv))
class MarkdownError(builtins.Exception):
146class MarkdownError(Exception):
147    pass

Common base class for all non-exit exceptions.

Inherited Members
builtins.Exception
Exception
builtins.BaseException
with_traceback
def markdown_path( path, encoding='utf-8', html4tags=False, tab_width=4, safe_mode=None, extras=None, link_patterns=None, footnote_title=None, footnote_return_symbol=None, use_file_vars=False):
152def markdown_path(path, encoding="utf-8",
153                  html4tags=False, tab_width=DEFAULT_TAB_WIDTH,
154                  safe_mode=None, extras=None, link_patterns=None,
155                  footnote_title=None, footnote_return_symbol=None,
156                  use_file_vars=False):
157    fp = codecs.open(path, 'r', encoding)
158    text = fp.read()
159    fp.close()
160    return Markdown(html4tags=html4tags, tab_width=tab_width,
161                    safe_mode=safe_mode, extras=extras,
162                    link_patterns=link_patterns,
163                    footnote_title=footnote_title,
164                    footnote_return_symbol=footnote_return_symbol,
165                    use_file_vars=use_file_vars).convert(text)
def markdown( text, html4tags=False, tab_width=4, safe_mode=None, extras=None, link_patterns=None, footnote_title=None, footnote_return_symbol=None, use_file_vars=False, cli=False):
168def markdown(text, html4tags=False, tab_width=DEFAULT_TAB_WIDTH,
169             safe_mode=None, extras=None, link_patterns=None,
170             footnote_title=None, footnote_return_symbol=None,
171             use_file_vars=False, cli=False):
172    return Markdown(html4tags=html4tags, tab_width=tab_width,
173                    safe_mode=safe_mode, extras=extras,
174                    link_patterns=link_patterns,
175                    footnote_title=footnote_title,
176                    footnote_return_symbol=footnote_return_symbol,
177                    use_file_vars=use_file_vars, cli=cli).convert(text)
class Markdown:
 180class Markdown(object):
 181    # The dict of "extras" to enable in processing -- a mapping of
 182    # extra name to argument for the extra. Most extras do not have an
 183    # argument, in which case the value is None.
 184    #
 185    # This can be set via (a) subclassing and (b) the constructor
 186    # "extras" argument.
 187    extras = None
 188
 189    urls = None
 190    titles = None
 191    html_blocks = None
 192    html_spans = None
 193    html_removed_text = "{(#HTML#)}"  # placeholder removed text that does not trigger bold
 194    html_removed_text_compat = "[HTML_REMOVED]"  # for compat with markdown.py
 195
 196    _toc = None
 197
 198    # Used to track when we're inside an ordered or unordered list
 199    # (see _ProcessListItems() for details):
 200    list_level = 0
 201
 202    _ws_only_line_re = re.compile(r"^[ \t]+$", re.M)
 203
 204    def __init__(self, html4tags=False, tab_width=4, safe_mode=None,
 205                 extras=None, link_patterns=None,
 206                 footnote_title=None, footnote_return_symbol=None,
 207                 use_file_vars=False, cli=False):
 208        if html4tags:
 209            self.empty_element_suffix = ">"
 210        else:
 211            self.empty_element_suffix = " />"
 212        self.tab_width = tab_width
 213        self.tab = tab_width * " "
 214
 215        # For compatibility with earlier markdown2.py and with
 216        # markdown.py's safe_mode being a boolean,
 217        #   safe_mode == True -> "replace"
 218        if safe_mode is True:
 219            self.safe_mode = "replace"
 220        else:
 221            self.safe_mode = safe_mode
 222
 223        # Massaging and building the "extras" info.
 224        if self.extras is None:
 225            self.extras = {}
 226        elif not isinstance(self.extras, dict):
 227            self.extras = dict([(e, None) for e in self.extras])
 228        if extras:
 229            if not isinstance(extras, dict):
 230                extras = dict([(e, None) for e in extras])
 231            self.extras.update(extras)
 232        assert isinstance(self.extras, dict)
 233
 234        if "toc" in self.extras:
 235            if "header-ids" not in self.extras:
 236                self.extras["header-ids"] = None  # "toc" implies "header-ids"
 237
 238            if self.extras["toc"] is None:
 239                self._toc_depth = 6
 240            else:
 241                self._toc_depth = self.extras["toc"].get("depth", 6)
 242        self._instance_extras = self.extras.copy()
 243
 244        if 'link-patterns' in self.extras:
 245            if link_patterns is None:
 246                # if you have specified that the link-patterns extra SHOULD
 247                # be used (via self.extras) but you haven't provided anything
 248                # via the link_patterns argument then an error is raised
 249                raise MarkdownError("If the 'link-patterns' extra is used, an argument for 'link_patterns' is required")
 250        self.link_patterns = link_patterns
 251        self.footnote_title = footnote_title
 252        self.footnote_return_symbol = footnote_return_symbol
 253        self.use_file_vars = use_file_vars
 254        self._outdent_re = re.compile(r'^(\t|[ ]{1,%d})' % tab_width, re.M)
 255        self.cli = cli
 256
 257        self._escape_table = g_escape_table.copy()
 258        self._code_table = {}
 259        if "smarty-pants" in self.extras:
 260            self._escape_table['"'] = _hash_text('"')
 261            self._escape_table["'"] = _hash_text("'")
 262
 263    def reset(self):
 264        self.urls = {}
 265        self.titles = {}
 266        self.html_blocks = {}
 267        self.html_spans = {}
 268        self.list_level = 0
 269        self.extras = self._instance_extras.copy()
 270        self._setup_extras()
 271        self._toc = None
 272
 273    def _setup_extras(self):
 274        if "footnotes" in self.extras:
 275            self.footnotes = {}
 276            self.footnote_ids = []
 277        if "header-ids" in self.extras:
 278            self._count_from_header_id = defaultdict(int)
 279        if "metadata" in self.extras:
 280            self.metadata = {}
 281
 282    # Per <https://developer.mozilla.org/en-US/docs/HTML/Element/a> "rel"
 283    # should only be used in <a> tags with an "href" attribute.
 284
 285    # Opens the linked document in a new window or tab
 286    # should only used in <a> tags with an "href" attribute.
 287    # same with _a_nofollow
 288    _a_nofollow_or_blank_links = re.compile(r"""
 289        <(a)
 290        (
 291            [^>]*
 292            href=   # href is required
 293            ['"]?   # HTML5 attribute values do not have to be quoted
 294            [^#'"]  # We don't want to match href values that start with # (like footnotes)
 295        )
 296        """,
 297                                            re.IGNORECASE | re.VERBOSE
 298                                            )
 299
 300    def convert(self, text):
 301        """Convert the given text."""
 302        # Main function. The order in which other subs are called here is
 303        # essential. Link and image substitutions need to happen before
 304        # _EscapeSpecialChars(), so that any *'s or _'s in the <a>
 305        # and <img> tags get encoded.
 306
 307        # Clear the global hashes. If we don't clear these, you get conflicts
 308        # from other articles when generating a page which contains more than
 309        # one article (e.g. an index page that shows the N most recent
 310        # articles):
 311        self.reset()
 312
 313        if not isinstance(text, str):
 314            # TODO: perhaps shouldn't presume UTF-8 for string input?
 315            text = str(text, 'utf-8')
 316
 317        if self.use_file_vars:
 318            # Look for emacs-style file variable hints.
 319            text = self._emacs_oneliner_vars_pat.sub(self._emacs_vars_oneliner_sub, text)
 320            emacs_vars = self._get_emacs_vars(text)
 321            if "markdown-extras" in emacs_vars:
 322                splitter = re.compile("[ ,]+")
 323                for e in splitter.split(emacs_vars["markdown-extras"]):
 324                    if '=' in e:
 325                        ename, earg = e.split('=', 1)
 326                        try:
 327                            earg = int(earg)
 328                        except ValueError:
 329                            pass
 330                    else:
 331                        ename, earg = e, None
 332                    self.extras[ename] = earg
 333
 334            self._setup_extras()
 335
 336        # Standardize line endings:
 337        text = text.replace("\r\n", "\n")
 338        text = text.replace("\r", "\n")
 339
 340        # Make sure $text ends with a couple of newlines:
 341        text += "\n\n"
 342
 343        # Convert all tabs to spaces.
 344        text = self._detab(text)
 345
 346        # Strip any lines consisting only of spaces and tabs.
 347        # This makes subsequent regexen easier to write, because we can
 348        # match consecutive blank lines with /\n+/ instead of something
 349        # contorted like /[ \t]*\n+/ .
 350        text = self._ws_only_line_re.sub("", text)
 351
 352        # strip metadata from head and extract
 353        if "metadata" in self.extras:
 354            text = self._extract_metadata(text)
 355
 356        text = self.preprocess(text)
 357
 358        if "fenced-code-blocks" in self.extras and not self.safe_mode:
 359            text = self._do_fenced_code_blocks(text)
 360
 361        if self.safe_mode:
 362            text = self._hash_html_spans(text)
 363
 364        # Turn block-level HTML blocks into hash entries
 365        text = self._hash_html_blocks(text, raw=True)
 366
 367        if "fenced-code-blocks" in self.extras and self.safe_mode:
 368            text = self._do_fenced_code_blocks(text)
 369
 370        if 'admonitions' in self.extras:
 371            text = self._do_admonitions(text)
 372
 373        # Because numbering references aren't links (yet?) then we can do everything associated with counters
 374        # before we get started
 375        if "numbering" in self.extras:
 376            text = self._do_numbering(text)
 377
 378        # Strip link definitions, store in hashes.
 379        if "footnotes" in self.extras:
 380            # Must do footnotes first because an unlucky footnote defn
 381            # looks like a link defn:
 382            #   [^4]: this "looks like a link defn"
 383            text = self._strip_footnote_definitions(text)
 384        text = self._strip_link_definitions(text)
 385
 386        text = self._run_block_gamut(text)
 387
 388        if "footnotes" in self.extras:
 389            text = self._add_footnotes(text)
 390
 391        text = self.postprocess(text)
 392
 393        text = self._unescape_special_chars(text)
 394
 395        if self.safe_mode:
 396            text = self._unhash_html_spans(text)
 397            # return the removed text warning to its markdown.py compatible form
 398            text = text.replace(self.html_removed_text, self.html_removed_text_compat)
 399
 400        do_target_blank_links = "target-blank-links" in self.extras
 401        do_nofollow_links = "nofollow" in self.extras
 402
 403        if do_target_blank_links and do_nofollow_links:
 404            text = self._a_nofollow_or_blank_links.sub(r'<\1 rel="nofollow noopener" target="_blank"\2', text)
 405        elif do_target_blank_links:
 406            text = self._a_nofollow_or_blank_links.sub(r'<\1 rel="noopener" target="_blank"\2', text)
 407        elif do_nofollow_links:
 408            text = self._a_nofollow_or_blank_links.sub(r'<\1 rel="nofollow"\2', text)
 409
 410        if "toc" in self.extras and self._toc:
 411            self._toc_html = calculate_toc_html(self._toc)
 412
 413            # Prepend toc html to output
 414            if self.cli:
 415                text = '{}\n{}'.format(self._toc_html, text)
 416
 417        text += "\n"
 418
 419        # Attach attrs to output
 420        rv = UnicodeWithAttrs(text)
 421
 422        if "toc" in self.extras and self._toc:
 423            rv.toc_html = self._toc_html
 424
 425        if "metadata" in self.extras:
 426            rv.metadata = self.metadata
 427        return rv
 428
 429    def postprocess(self, text):
 430        """A hook for subclasses to do some postprocessing of the html, if
 431        desired. This is called before unescaping of special chars and
 432        unhashing of raw HTML spans.
 433        """
 434        return text
 435
 436    def preprocess(self, text):
 437        """A hook for subclasses to do some preprocessing of the Markdown, if
 438        desired. This is called after basic formatting of the text, but prior
 439        to any extras, safe mode, etc. processing.
 440        """
 441        return text
 442
 443    # Is metadata if the content starts with optional '---'-fenced `key: value`
 444    # pairs. E.g. (indented for presentation):
 445    #   ---
 446    #   foo: bar
 447    #   another-var: blah blah
 448    #   ---
 449    #   # header
 450    # or:
 451    #   foo: bar
 452    #   another-var: blah blah
 453    #
 454    #   # header
 455    _meta_data_pattern = re.compile(r'''
 456        ^(?:---[\ \t]*\n)?(  # optional opening fence
 457            (?:
 458                [\S \t]*\w[\S \t]*\s*:(?:\n+[ \t]+.*)+  # indented lists
 459            )|(?:
 460                (?:[\S \t]*\w[\S \t]*\s*:\s+>(?:\n\s+.*)+?)  # multiline long descriptions
 461                (?=\n[\S \t]*\w[\S \t]*\s*:\s*.*\n|\s*\Z)  # match up until the start of the next key:value definition or the end of the input text
 462            )|(?:
 463                [\S \t]*\w[\S \t]*\s*:(?! >).*\n?  # simple key:value pair, leading spaces allowed
 464            )
 465        )(?:---[\ \t]*\n)?  # optional closing fence
 466        ''', re.MULTILINE | re.VERBOSE
 467                                    )
 468
 469    _key_val_list_pat = re.compile(
 470        r"^-(?:[ \t]*([^\n]*)(?:[ \t]*[:-][ \t]*(\S+))?)(?:\n((?:[ \t]+[^\n]+\n?)+))?",
 471        re.MULTILINE,
 472    )
 473    _key_val_dict_pat = re.compile(
 474        r"^([^:\n]+)[ \t]*:[ \t]*([^\n]*)(?:((?:\n[ \t]+[^\n]+)+))?", re.MULTILINE
 475    )  # grp0: key, grp1: value, grp2: multiline value
 476    _meta_data_fence_pattern = re.compile(r'^---[\ \t]*\n', re.MULTILINE)
 477    _meta_data_newline = re.compile("^\n", re.MULTILINE)
 478
 479    def _extract_metadata(self, text):
 480        if text.startswith("---"):
 481            fence_splits = re.split(self._meta_data_fence_pattern, text, maxsplit=2)
 482            metadata_content = fence_splits[1]
 483            match = re.findall(self._meta_data_pattern, metadata_content)
 484            if not match:
 485                return text
 486            tail = fence_splits[2]
 487        else:
 488            metadata_split = re.split(self._meta_data_newline, text, maxsplit=1)
 489            metadata_content = metadata_split[0]
 490            match = re.findall(self._meta_data_pattern, metadata_content)
 491            if not match:
 492                return text
 493            tail = metadata_split[1]
 494
 495        def parse_structured_value(value):
 496            vs = value.lstrip()
 497            vs = value.replace(v[: len(value) - len(vs)], "\n")[1:]
 498
 499            # List
 500            if vs.startswith("-"):
 501                r = []
 502                for match in re.findall(self._key_val_list_pat, vs):
 503                    if match[0] and not match[1] and not match[2]:
 504                        r.append(match[0].strip())
 505                    elif match[0] == ">" and not match[1] and match[2]:
 506                        r.append(match[2].strip())
 507                    elif match[0] and match[1]:
 508                        r.append({match[0].strip(): match[1].strip()})
 509                    elif not match[0] and not match[1] and match[2]:
 510                        r.append(parse_structured_value(match[2]))
 511                    else:
 512                        # Broken case
 513                        pass
 514
 515                return r
 516
 517            # Dict
 518            else:
 519                return {
 520                    match[0].strip(): (
 521                        match[1].strip()
 522                        if match[1]
 523                        else parse_structured_value(match[2])
 524                    )
 525                    for match in re.findall(self._key_val_dict_pat, vs)
 526                }
 527
 528        for item in match:
 529
 530            k, v = item.split(":", 1)
 531
 532            # Multiline value
 533            if v[:3] == " >\n":
 534                self.metadata[k.strip()] = _dedent(v[3:]).strip()
 535
 536            # Empty value
 537            elif v == "\n":
 538                self.metadata[k.strip()] = ""
 539
 540            # Structured value
 541            elif v[0] == "\n":
 542                self.metadata[k.strip()] = parse_structured_value(v)
 543
 544            # Simple value
 545            else:
 546                self.metadata[k.strip()] = v.strip()
 547
 548        return tail
 549
 550    _emacs_oneliner_vars_pat = re.compile(r"((?:<!--)?\s*-\*-)\s*(?:(\S[^\r\n]*?)([\r\n]\s*)?)?(-\*-\s*(?:-->)?)",
 551                                          re.UNICODE)
 552    # This regular expression is intended to match blocks like this:
 553    #    PREFIX Local Variables: SUFFIX
 554    #    PREFIX mode: Tcl SUFFIX
 555    #    PREFIX End: SUFFIX
 556    # Some notes:
 557    # - "[ \t]" is used instead of "\s" to specifically exclude newlines
 558    # - "(\r\n|\n|\r)" is used instead of "$" because the sre engine does
 559    #   not like anything other than Unix-style line terminators.
 560    _emacs_local_vars_pat = re.compile(r"""^
 561        (?P<prefix>(?:[^\r\n|\n|\r])*?)
 562        [\ \t]*Local\ Variables:[\ \t]*
 563        (?P<suffix>.*?)(?:\r\n|\n|\r)
 564        (?P<content>.*?\1End:)
 565        """, re.IGNORECASE | re.MULTILINE | re.DOTALL | re.VERBOSE)
 566
 567    def _emacs_vars_oneliner_sub(self, match):
 568        if match.group(1).strip() == '-*-' and match.group(4).strip() == '-*-':
 569            lead_ws = re.findall(r'^\s*', match.group(1))[0]
 570            tail_ws = re.findall(r'\s*$', match.group(4))[0]
 571            return '%s<!-- %s %s %s -->%s' % (lead_ws, '-*-', match.group(2).strip(), '-*-', tail_ws)
 572
 573        start, end = match.span()
 574        return match.string[start: end]
 575
 576    def _get_emacs_vars(self, text):
 577        """Return a dictionary of emacs-style local variables.
 578
 579        Parsing is done loosely according to this spec (and according to
 580        some in-practice deviations from this):
 581        http://www.gnu.org/software/emacs/manual/html_node/emacs/Specifying-File-Variables.html#Specifying-File-Variables
 582        """
 583        emacs_vars = {}
 584        SIZE = pow(2, 13)  # 8kB
 585
 586        # Search near the start for a '-*-'-style one-liner of variables.
 587        head = text[:SIZE]
 588        if "-*-" in head:
 589            match = self._emacs_oneliner_vars_pat.search(head)
 590            if match:
 591                emacs_vars_str = match.group(2)
 592                assert '\n' not in emacs_vars_str
 593                emacs_var_strs = [s.strip() for s in emacs_vars_str.split(';')
 594                                  if s.strip()]
 595                if len(emacs_var_strs) == 1 and ':' not in emacs_var_strs[0]:
 596                    # While not in the spec, this form is allowed by emacs:
 597                    #   -*- Tcl -*-
 598                    # where the implied "variable" is "mode". This form
 599                    # is only allowed if there are no other variables.
 600                    emacs_vars["mode"] = emacs_var_strs[0].strip()
 601                else:
 602                    for emacs_var_str in emacs_var_strs:
 603                        try:
 604                            variable, value = emacs_var_str.strip().split(':', 1)
 605                        except ValueError:
 606                            log.debug("emacs variables error: malformed -*- "
 607                                      "line: %r", emacs_var_str)
 608                            continue
 609                        # Lowercase the variable name because Emacs allows "Mode"
 610                        # or "mode" or "MoDe", etc.
 611                        emacs_vars[variable.lower()] = value.strip()
 612
 613        tail = text[-SIZE:]
 614        if "Local Variables" in tail:
 615            match = self._emacs_local_vars_pat.search(tail)
 616            if match:
 617                prefix = match.group("prefix")
 618                suffix = match.group("suffix")
 619                lines = match.group("content").splitlines(0)
 620                # print "prefix=%r, suffix=%r, content=%r, lines: %s"\
 621                #      % (prefix, suffix, match.group("content"), lines)
 622
 623                # Validate the Local Variables block: proper prefix and suffix
 624                # usage.
 625                for i, line in enumerate(lines):
 626                    if not line.startswith(prefix):
 627                        log.debug("emacs variables error: line '%s' "
 628                                  "does not use proper prefix '%s'"
 629                                  % (line, prefix))
 630                        return {}
 631                    # Don't validate suffix on last line. Emacs doesn't care,
 632                    # neither should we.
 633                    if i != len(lines) - 1 and not line.endswith(suffix):
 634                        log.debug("emacs variables error: line '%s' "
 635                                  "does not use proper suffix '%s'"
 636                                  % (line, suffix))
 637                        return {}
 638
 639                # Parse out one emacs var per line.
 640                continued_for = None
 641                for line in lines[:-1]:  # no var on the last line ("PREFIX End:")
 642                    if prefix: line = line[len(prefix):]  # strip prefix
 643                    if suffix: line = line[:-len(suffix)]  # strip suffix
 644                    line = line.strip()
 645                    if continued_for:
 646                        variable = continued_for
 647                        if line.endswith('\\'):
 648                            line = line[:-1].rstrip()
 649                        else:
 650                            continued_for = None
 651                        emacs_vars[variable] += ' ' + line
 652                    else:
 653                        try:
 654                            variable, value = line.split(':', 1)
 655                        except ValueError:
 656                            log.debug("local variables error: missing colon "
 657                                      "in local variables entry: '%s'" % line)
 658                            continue
 659                        # Do NOT lowercase the variable name, because Emacs only
 660                        # allows "mode" (and not "Mode", "MoDe", etc.) in this block.
 661                        value = value.strip()
 662                        if value.endswith('\\'):
 663                            value = value[:-1].rstrip()
 664                            continued_for = variable
 665                        else:
 666                            continued_for = None
 667                        emacs_vars[variable] = value
 668
 669        # Unquote values.
 670        for var, val in list(emacs_vars.items()):
 671            if len(val) > 1 and (val.startswith('"') and val.endswith('"')
 672                                 or val.startswith('"') and val.endswith('"')):
 673                emacs_vars[var] = val[1:-1]
 674
 675        return emacs_vars
 676
 677    def _detab_line(self, line):
 678        r"""Recusively convert tabs to spaces in a single line.
 679
 680        Called from _detab()."""
 681        if '\t' not in line:
 682            return line
 683        chunk1, chunk2 = line.split('\t', 1)
 684        chunk1 += (' ' * (self.tab_width - len(chunk1) % self.tab_width))
 685        output = chunk1 + chunk2
 686        return self._detab_line(output)
 687
 688    def _detab(self, text):
 689        r"""Iterate text line by line and convert tabs to spaces.
 690
 691            >>> m = Markdown()
 692            >>> m._detab("\tfoo")
 693            '    foo'
 694            >>> m._detab("  \tfoo")
 695            '    foo'
 696            >>> m._detab("\t  foo")
 697            '      foo'
 698            >>> m._detab("  foo")
 699            '  foo'
 700            >>> m._detab("  foo\n\tbar\tblam")
 701            '  foo\n    bar blam'
 702        """
 703        if '\t' not in text:
 704            return text
 705        output = []
 706        for line in text.splitlines():
 707            output.append(self._detab_line(line))
 708        return '\n'.join(output)
 709
 710    # I broke out the html5 tags here and add them to _block_tags_a and
 711    # _block_tags_b.  This way html5 tags are easy to keep track of.
 712    _html5tags = '|article|aside|header|hgroup|footer|nav|section|figure|figcaption'
 713
 714    _block_tags_a = 'p|div|h[1-6]|blockquote|pre|table|dl|ol|ul|script|noscript|form|fieldset|iframe|math|ins|del'
 715    _block_tags_a += _html5tags
 716
 717    _strict_tag_block_re = re.compile(r"""
 718        (                       # save in \1
 719            ^                   # start of line  (with re.M)
 720            <(%s)               # start tag = \2
 721            \b                  # word break
 722            (.*\n)*?            # any number of lines, minimally matching
 723            </\2>               # the matching end tag
 724            [ \t]*              # trailing spaces/tabs
 725            (?=\n+|\Z)          # followed by a newline or end of document
 726        )
 727        """ % _block_tags_a,
 728                                      re.X | re.M)
 729
 730    _block_tags_b = 'p|div|h[1-6]|blockquote|pre|table|dl|ol|ul|script|noscript|form|fieldset|iframe|math'
 731    _block_tags_b += _html5tags
 732
 733    _liberal_tag_block_re = re.compile(r"""
 734        (                       # save in \1
 735            ^                   # start of line  (with re.M)
 736            <(%s)               # start tag = \2
 737            \b                  # word break
 738            (.*\n)*?            # any number of lines, minimally matching
 739            .*</\2>             # the matching end tag
 740            [ \t]*              # trailing spaces/tabs
 741            (?=\n+|\Z)          # followed by a newline or end of document
 742        )
 743        """ % _block_tags_b,
 744                                       re.X | re.M)
 745
 746    _html_markdown_attr_re = re.compile(
 747        r'''\s+markdown=("1"|'1')''')
 748
 749    def _hash_html_block_sub(self, match, raw=False):
 750        html = match.group(1)
 751        if raw and self.safe_mode:
 752            html = self._sanitize_html(html)
 753        elif 'markdown-in-html' in self.extras and 'markdown=' in html:
 754            first_line = html.split('\n', 1)[0]
 755            m = self._html_markdown_attr_re.search(first_line)
 756            if m:
 757                lines = html.split('\n')
 758                middle = '\n'.join(lines[1:-1])
 759                last_line = lines[-1]
 760                first_line = first_line[:m.start()] + first_line[m.end():]
 761                f_key = _hash_text(first_line)
 762                self.html_blocks[f_key] = first_line
 763                l_key = _hash_text(last_line)
 764                self.html_blocks[l_key] = last_line
 765                return ''.join(["\n\n", f_key,
 766                                "\n\n", middle, "\n\n",
 767                                l_key, "\n\n"])
 768        key = _hash_text(html)
 769        self.html_blocks[key] = html
 770        return "\n\n" + key + "\n\n"
 771
 772    def _hash_html_blocks(self, text, raw=False):
 773        """Hashify HTML blocks
 774
 775        We only want to do this for block-level HTML tags, such as headers,
 776        lists, and tables. That's because we still want to wrap <p>s around
 777        "paragraphs" that are wrapped in non-block-level tags, such as anchors,
 778        phrase emphasis, and spans. The list of tags we're looking for is
 779        hard-coded.
 780
 781        @param raw {boolean} indicates if these are raw HTML blocks in
 782            the original source. It makes a difference in "safe" mode.
 783        """
 784        if '<' not in text:
 785            return text
 786
 787        # Pass `raw` value into our calls to self._hash_html_block_sub.
 788        hash_html_block_sub = _curry(self._hash_html_block_sub, raw=raw)
 789
 790        # First, look for nested blocks, e.g.:
 791        #   <div>
 792        #       <div>
 793        #       tags for inner block must be indented.
 794        #       </div>
 795        #   </div>
 796        #
 797        # The outermost tags must start at the left margin for this to match, and
 798        # the inner nested divs must be indented.
 799        # We need to do this before the next, more liberal match, because the next
 800        # match will start at the first `<div>` and stop at the first `</div>`.
 801        text = self._strict_tag_block_re.sub(hash_html_block_sub, text)
 802
 803        # Now match more liberally, simply from `\n<tag>` to `</tag>\n`
 804        text = self._liberal_tag_block_re.sub(hash_html_block_sub, text)
 805
 806        # Special case just for <hr />. It was easier to make a special
 807        # case than to make the other regex more complicated.
 808        if "<hr" in text:
 809            _hr_tag_re = _hr_tag_re_from_tab_width(self.tab_width)
 810            text = _hr_tag_re.sub(hash_html_block_sub, text)
 811
 812        # Special case for standalone HTML comments:
 813        if "<!--" in text:
 814            start = 0
 815            while True:
 816                # Delimiters for next comment block.
 817                try:
 818                    start_idx = text.index("<!--", start)
 819                except ValueError:
 820                    break
 821                try:
 822                    end_idx = text.index("-->", start_idx) + 3
 823                except ValueError:
 824                    break
 825
 826                # Start position for next comment block search.
 827                start = end_idx
 828
 829                # Validate whitespace before comment.
 830                if start_idx:
 831                    # - Up to `tab_width - 1` spaces before start_idx.
 832                    for i in range(self.tab_width - 1):
 833                        if text[start_idx - 1] != ' ':
 834                            break
 835                        start_idx -= 1
 836                        if start_idx == 0:
 837                            break
 838                    # - Must be preceded by 2 newlines or hit the start of
 839                    #   the document.
 840                    if start_idx == 0:
 841                        pass
 842                    elif start_idx == 1 and text[0] == '\n':
 843                        start_idx = 0  # to match minute detail of Markdown.pl regex
 844                    elif text[start_idx - 2:start_idx] == '\n\n':
 845                        pass
 846                    else:
 847                        break
 848
 849                # Validate whitespace after comment.
 850                # - Any number of spaces and tabs.
 851                while end_idx < len(text):
 852                    if text[end_idx] not in ' \t':
 853                        break
 854                    end_idx += 1
 855                # - Must be following by 2 newlines or hit end of text.
 856                if text[end_idx:end_idx + 2] not in ('', '\n', '\n\n'):
 857                    continue
 858
 859                # Escape and hash (must match `_hash_html_block_sub`).
 860                html = text[start_idx:end_idx]
 861                if raw and self.safe_mode:
 862                    html = self._sanitize_html(html)
 863                key = _hash_text(html)
 864                self.html_blocks[key] = html
 865                text = text[:start_idx] + "\n\n" + key + "\n\n" + text[end_idx:]
 866
 867        if "xml" in self.extras:
 868            # Treat XML processing instructions and namespaced one-liner
 869            # tags as if they were block HTML tags. E.g., if standalone
 870            # (i.e. are their own paragraph), the following do not get
 871            # wrapped in a <p> tag:
 872            #    <?foo bar?>
 873            #
 874            #    <xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="chapter_1.md"/>
 875            _xml_oneliner_re = _xml_oneliner_re_from_tab_width(self.tab_width)
 876            text = _xml_oneliner_re.sub(hash_html_block_sub, text)
 877
 878        return text
 879
 880    def _strip_link_definitions(self, text):
 881        # Strips link definitions from text, stores the URLs and titles in
 882        # hash references.
 883        less_than_tab = self.tab_width - 1
 884
 885        # Link defs are in the form:
 886        #   [id]: url "optional title"
 887        _link_def_re = re.compile(r"""
 888            ^[ ]{0,%d}\[(.+)\]: # id = \1
 889              [ \t]*
 890              \n?               # maybe *one* newline
 891              [ \t]*
 892            <?(.+?)>?           # url = \2
 893              [ \t]*
 894            (?:
 895                \n?             # maybe one newline
 896                [ \t]*
 897                (?<=\s)         # lookbehind for whitespace
 898                ['"(]
 899                ([^\n]*)        # title = \3
 900                ['")]
 901                [ \t]*
 902            )?  # title is optional
 903            (?:\n+|\Z)
 904            """ % less_than_tab, re.X | re.M | re.U)
 905        return _link_def_re.sub(self._extract_link_def_sub, text)
 906
 907    def _extract_link_def_sub(self, match):
 908        id, url, title = match.groups()
 909        key = id.lower()  # Link IDs are case-insensitive
 910        self.urls[key] = self._encode_amps_and_angles(url)
 911        if title:
 912            self.titles[key] = title
 913        return ""
 914
 915    def _do_numbering(self, text):
 916        ''' We handle the special extension for generic numbering for
 917            tables, figures etc.
 918        '''
 919        # First pass to define all the references
 920        self.regex_defns = re.compile(r'''
 921            \[\#(\w+) # the counter.  Open square plus hash plus a word \1
 922            ([^@]*)   # Some optional characters, that aren't an @. \2
 923            @(\w+)       # the id.  Should this be normed? \3
 924            ([^\]]*)\]   # The rest of the text up to the terminating ] \4
 925            ''', re.VERBOSE)
 926        self.regex_subs = re.compile(r"\[@(\w+)\s*\]")  # [@ref_id]
 927        counters = {}
 928        references = {}
 929        replacements = []
 930        definition_html = '<figcaption class="{}" id="counter-ref-{}">{}{}{}</figcaption>'
 931        reference_html = '<a class="{}" href="#counter-ref-{}">{}</a>'
 932        for match in self.regex_defns.finditer(text):
 933            # We must have four match groups otherwise this isn't a numbering reference
 934            if len(match.groups()) != 4:
 935                continue
 936            counter = match.group(1)
 937            text_before = match.group(2).strip()
 938            ref_id = match.group(3)
 939            text_after = match.group(4)
 940            number = counters.get(counter, 1)
 941            references[ref_id] = (number, counter)
 942            replacements.append((match.start(0),
 943                                 definition_html.format(counter,
 944                                                        ref_id,
 945                                                        text_before,
 946                                                        number,
 947                                                        text_after),
 948                                 match.end(0)))
 949            counters[counter] = number + 1
 950        for repl in reversed(replacements):
 951            text = text[:repl[0]] + repl[1] + text[repl[2]:]
 952
 953        # Second pass to replace the references with the right
 954        # value of the counter
 955        # Fwiw, it's vaguely annoying to have to turn the iterator into
 956        # a list and then reverse it but I can't think of a better thing to do.
 957        for match in reversed(list(self.regex_subs.finditer(text))):
 958            number, counter = references.get(match.group(1), (None, None))
 959            if number is not None:
 960                repl = reference_html.format(counter,
 961                                             match.group(1),
 962                                             number)
 963            else:
 964                repl = reference_html.format(match.group(1),
 965                                             'countererror',
 966                                             '?' + match.group(1) + '?')
 967            if "smarty-pants" in self.extras:
 968                repl = repl.replace('"', self._escape_table['"'])
 969
 970            text = text[:match.start()] + repl + text[match.end():]
 971        return text
 972
 973    def _extract_footnote_def_sub(self, match):
 974        id, text = match.groups()
 975        text = _dedent(text, skip_first_line=not text.startswith('\n')).strip()
 976        normed_id = re.sub(r'\W', '-', id)
 977        # Ensure footnote text ends with a couple newlines (for some
 978        # block gamut matches).
 979        self.footnotes[normed_id] = text + "\n\n"
 980        return ""
 981
 982    def _strip_footnote_definitions(self, text):
 983        """A footnote definition looks like this:
 984
 985            [^note-id]: Text of the note.
 986
 987                May include one or more indented paragraphs.
 988
 989        Where,
 990        - The 'note-id' can be pretty much anything, though typically it
 991          is the number of the footnote.
 992        - The first paragraph may start on the next line, like so:
 993
 994            [^note-id]:
 995                Text of the note.
 996        """
 997        less_than_tab = self.tab_width - 1
 998        footnote_def_re = re.compile(r'''
 999            ^[ ]{0,%d}\[\^(.+)\]:   # id = \1
1000            [ \t]*
1001            (                       # footnote text = \2
1002              # First line need not start with the spaces.
1003              (?:\s*.*\n+)
1004              (?:
1005                (?:[ ]{%d} | \t)  # Subsequent lines must be indented.
1006                .*\n+
1007              )*
1008            )
1009            # Lookahead for non-space at line-start, or end of doc.
1010            (?:(?=^[ ]{0,%d}\S)|\Z)
1011            ''' % (less_than_tab, self.tab_width, self.tab_width),
1012                                     re.X | re.M)
1013        return footnote_def_re.sub(self._extract_footnote_def_sub, text)
1014
1015    _hr_re = re.compile(r'^[ ]{0,3}([-_*])[ ]{0,2}(\1[ ]{0,2}){2,}$', re.M)
1016
1017    def _run_block_gamut(self, text):
1018        # These are all the transformations that form block-level
1019        # tags like paragraphs, headers, and list items.
1020
1021        if 'admonitions' in self.extras:
1022            text = self._do_admonitions(text)
1023
1024        if "fenced-code-blocks" in self.extras:
1025            text = self._do_fenced_code_blocks(text)
1026
1027        text = self._do_headers(text)
1028
1029        # Do Horizontal Rules:
1030        # On the number of spaces in horizontal rules: The spec is fuzzy: "If
1031        # you wish, you may use spaces between the hyphens or asterisks."
1032        # Markdown.pl 1.0.1's hr regexes limit the number of spaces between the
1033        # hr chars to one or two. We'll reproduce that limit here.
1034        hr = "\n<hr" + self.empty_element_suffix + "\n"
1035        text = re.sub(self._hr_re, hr, text)
1036
1037        text = self._do_lists(text)
1038
1039        if "pyshell" in self.extras:
1040            text = self._prepare_pyshell_blocks(text)
1041        if "wiki-tables" in self.extras:
1042            text = self._do_wiki_tables(text)
1043        if "tables" in self.extras:
1044            text = self._do_tables(text)
1045
1046        text = self._do_code_blocks(text)
1047
1048        text = self._do_block_quotes(text)
1049
1050        # We already ran _HashHTMLBlocks() before, in Markdown(), but that
1051        # was to escape raw HTML in the original Markdown source. This time,
1052        # we're escaping the markup we've just created, so that we don't wrap
1053        # <p> tags around block-level tags.
1054        text = self._hash_html_blocks(text)
1055
1056        text = self._form_paragraphs(text)
1057
1058        return text
1059
1060    def _pyshell_block_sub(self, match):
1061        if "fenced-code-blocks" in self.extras:
1062            dedented = _dedent(match.group(0))
1063            return self._do_fenced_code_blocks("```pycon\n" + dedented + "```\n")
1064        lines = match.group(0).splitlines(0)
1065        _dedentlines(lines)
1066        indent = ' ' * self.tab_width
1067        s = ('\n'  # separate from possible cuddled paragraph
1068             + indent + ('\n' + indent).join(lines)
1069             + '\n')
1070        return s
1071
1072    def _prepare_pyshell_blocks(self, text):
1073        """Ensure that Python interactive shell sessions are put in
1074        code blocks -- even if not properly indented.
1075        """
1076        if ">>>" not in text:
1077            return text
1078
1079        less_than_tab = self.tab_width - 1
1080        _pyshell_block_re = re.compile(r"""
1081            ^([ ]{0,%d})>>>[ ].*\n  # first line
1082            ^(\1[^\S\n]*\S.*\n)*    # any number of subsequent lines with at least one character
1083            (?=^\1?\n|\Z)           # ends with a blank line or end of document
1084            """ % less_than_tab, re.M | re.X)
1085
1086        return _pyshell_block_re.sub(self._pyshell_block_sub, text)
1087
1088    def _table_sub(self, match):
1089        trim_space_re = '^[ \t\n]+|[ \t\n]+$'
1090        trim_bar_re = r'^\||\|$'
1091        split_bar_re = r'^\||(?<![\`\\])\|'
1092        escape_bar_re = r'\\\|'
1093
1094        head, underline, body = match.groups()
1095
1096        # Determine aligns for columns.
1097        cols = [re.sub(escape_bar_re, '|', cell.strip()) for cell in
1098                re.split(split_bar_re, re.sub(trim_bar_re, "", re.sub(trim_space_re, "", underline)))]
1099        align_from_col_idx = {}
1100        for col_idx, col in enumerate(cols):
1101            if col[0] == ':' and col[-1] == ':':
1102                align_from_col_idx[col_idx] = ' style="text-align:center;"'
1103            elif col[0] == ':':
1104                align_from_col_idx[col_idx] = ' style="text-align:left;"'
1105            elif col[-1] == ':':
1106                align_from_col_idx[col_idx] = ' style="text-align:right;"'
1107
1108        # thead
1109        hlines = ['<table%s>' % self._html_class_str_from_tag('table'), '<thead>', '<tr>']
1110        cols = [re.sub(escape_bar_re, '|', cell.strip()) for cell in
1111                re.split(split_bar_re, re.sub(trim_bar_re, "", re.sub(trim_space_re, "", head)))]
1112        for col_idx, col in enumerate(cols):
1113            hlines.append('  <th%s>%s</th>' % (
1114                align_from_col_idx.get(col_idx, ''),
1115                self._run_span_gamut(col)
1116            ))
1117        hlines.append('</tr>')
1118        hlines.append('</thead>')
1119
1120        # tbody
1121        hlines.append('<tbody>')
1122        for line in body.strip('\n').split('\n'):
1123            hlines.append('<tr>')
1124            cols = [re.sub(escape_bar_re, '|', cell.strip()) for cell in
1125                    re.split(split_bar_re, re.sub(trim_bar_re, "", re.sub(trim_space_re, "", line)))]
1126            for col_idx, col in enumerate(cols):
1127                hlines.append('  <td%s>%s</td>' % (
1128                    align_from_col_idx.get(col_idx, ''),
1129                    self._run_span_gamut(col)
1130                ))
1131            hlines.append('</tr>')
1132        hlines.append('</tbody>')
1133        hlines.append('</table>')
1134
1135        return '\n'.join(hlines) + '\n'
1136
1137    def _do_tables(self, text):
1138        """Copying PHP-Markdown and GFM table syntax. Some regex borrowed from
1139        https://github.com/michelf/php-markdown/blob/lib/Michelf/Markdown.php#L2538
1140        """
1141        less_than_tab = self.tab_width - 1
1142        table_re = re.compile(r'''
1143                (?:(?<=\n\n)|\A\n?)             # leading blank line
1144
1145                ^[ ]{0,%d}                      # allowed whitespace
1146                (.*[|].*)  \n                   # $1: header row (at least one pipe)
1147
1148                ^[ ]{0,%d}                      # allowed whitespace
1149                (                               # $2: underline row
1150                    # underline row with leading bar
1151                    (?:  \|\ *:?-+:?\ *  )+  \|? \s? \n
1152                    |
1153                    # or, underline row without leading bar
1154                    (?:  \ *:?-+:?\ *\|  )+  (?:  \ *:?-+:?\ *  )? \s? \n
1155                )
1156
1157                (                               # $3: data rows
1158                    (?:
1159                        ^[ ]{0,%d}(?!\ )         # ensure line begins with 0 to less_than_tab spaces
1160                        .*\|.*  \n
1161                    )+
1162                )
1163            ''' % (less_than_tab, less_than_tab, less_than_tab), re.M | re.X)
1164        return table_re.sub(self._table_sub, text)
1165
1166    def _wiki_table_sub(self, match):
1167        ttext = match.group(0).strip()
1168        # print('wiki table: %r' % match.group(0))
1169        rows = []
1170        for line in ttext.splitlines(0):
1171            line = line.strip()[2:-2].strip()
1172            row = [c.strip() for c in re.split(r'(?<!\\)\|\|', line)]
1173            rows.append(row)
1174        # from pprint import pprint
1175        # pprint(rows)
1176        hlines = []
1177
1178        def add_hline(line, indents=0):
1179            hlines.append((self.tab * indents) + line)
1180
1181        def format_cell(text):
1182            return self._run_span_gamut(re.sub(r"^\s*~", "", cell).strip(" "))
1183
1184        add_hline('<table%s>' % self._html_class_str_from_tag('table'))
1185        # Check if first cell of first row is a header cell. If so, assume the whole row is a header row.
1186        if rows and rows[0] and re.match(r"^\s*~", rows[0][0]):
1187            add_hline('<thead>', 1)
1188            add_hline('<tr>', 2)
1189            for cell in rows[0]:
1190                add_hline("<th>{}</th>".format(format_cell(cell)), 3)
1191            add_hline('</tr>', 2)
1192            add_hline('</thead>', 1)
1193            # Only one header row allowed.
1194            rows = rows[1:]
1195        # If no more rows, don't create a tbody.
1196        if rows:
1197            add_hline('<tbody>', 1)
1198            for row in rows:
1199                add_hline('<tr>', 2)
1200                for cell in row:
1201                    add_hline('<td>{}</td>'.format(format_cell(cell)), 3)
1202                add_hline('</tr>', 2)
1203            add_hline('</tbody>', 1)
1204        add_hline('</table>')
1205        return '\n'.join(hlines) + '\n'
1206
1207    def _do_wiki_tables(self, text):
1208        # Optimization.
1209        if "||" not in text:
1210            return text
1211
1212        less_than_tab = self.tab_width - 1
1213        wiki_table_re = re.compile(r'''
1214            (?:(?<=\n\n)|\A\n?)            # leading blank line
1215            ^([ ]{0,%d})\|\|.+?\|\|[ ]*\n  # first line
1216            (^\1\|\|.+?\|\|\n)*        # any number of subsequent lines
1217            ''' % less_than_tab, re.M | re.X)
1218        return wiki_table_re.sub(self._wiki_table_sub, text)
1219
1220    def _run_span_gamut(self, text):
1221        # These are all the transformations that occur *within* block-level
1222        # tags like paragraphs, headers, and list items.
1223
1224        text = self._do_code_spans(text)
1225
1226        text = self._escape_special_chars(text)
1227
1228        # Process anchor and image tags.
1229        if "link-patterns" in self.extras:
1230            text = self._do_link_patterns(text)
1231
1232        text = self._do_links(text)
1233
1234        # Make links out of things like `<http://example.com/>`
1235        # Must come after _do_links(), because you can use < and >
1236        # delimiters in inline links like [this](<url>).
1237        text = self._do_auto_links(text)
1238
1239        text = self._encode_amps_and_angles(text)
1240
1241        if "strike" in self.extras:
1242            text = self._do_strike(text)
1243
1244        if "underline" in self.extras:
1245            text = self._do_underline(text)
1246
1247        text = self._do_italics_and_bold(text)
1248
1249        if "smarty-pants" in self.extras:
1250            text = self._do_smart_punctuation(text)
1251
1252        # Do hard breaks:
1253        if "break-on-newline" in self.extras:
1254            text = re.sub(r" *\n(?!\<(?:\/?(ul|ol|li))\>)", "<br%s\n" % self.empty_element_suffix, text)
1255        else:
1256            text = re.sub(r" {2,}\n", " <br%s\n" % self.empty_element_suffix, text)
1257
1258        return text
1259
1260    # "Sorta" because auto-links are identified as "tag" tokens.
1261    _sorta_html_tokenize_re = re.compile(r"""
1262        (
1263            # tag
1264            </?
1265            (?:\w+)                                     # tag name
1266            (?:\s+(?:[\w-]+:)?[\w-]+=(?:".*?"|'.*?'))*  # attributes
1267            \s*/?>
1268            |
1269            # auto-link (e.g., <http://www.activestate.com/>)
1270            <[\w~:/?#\[\]@!$&'\(\)*+,;%=\.\\-]+>
1271            |
1272            <!--.*?-->      # comment
1273            |
1274            <\?.*?\?>       # processing instruction
1275        )
1276        """, re.X)
1277
1278    def _escape_special_chars(self, text):
1279        # Python markdown note: the HTML tokenization here differs from
1280        # that in Markdown.pl, hence the behaviour for subtle cases can
1281        # differ (I believe the tokenizer here does a better job because
1282        # it isn't susceptible to unmatched '<' and '>' in HTML tags).
1283        # Note, however, that '>' is not allowed in an auto-link URL
1284        # here.
1285        escaped = []
1286        is_html_markup = False
1287        for token in self._sorta_html_tokenize_re.split(text):
1288            if is_html_markup:
1289                # Within tags/HTML-comments/auto-links, encode * and _
1290                # so they don't conflict with their use in Markdown for
1291                # italics and strong.  We're replacing each such
1292                # character with its corresponding MD5 checksum value;
1293                # this is likely overkill, but it should prevent us from
1294                # colliding with the escape values by accident.
1295                escaped.append(token.replace('*', self._escape_table['*'])
1296                               .replace('_', self._escape_table['_']))
1297            else:
1298                escaped.append(self._encode_backslash_escapes(token))
1299            is_html_markup = not is_html_markup
1300        return ''.join(escaped)
1301
1302    def _hash_html_spans(self, text):
1303        # Used for safe_mode.
1304
1305        def _is_auto_link(s):
1306            if ':' in s and self._auto_link_re.match(s):
1307                return True
1308            elif '@' in s and self._auto_email_link_re.match(s):
1309                return True
1310            return False
1311
1312        def _is_code_span(index, token):
1313            try:
1314                if token == '<code>':
1315                    peek_tokens = split_tokens[index: index + 3]
1316                elif token == '</code>':
1317                    peek_tokens = split_tokens[index - 2: index + 1]
1318                else:
1319                    return False
1320            except IndexError:
1321                return False
1322
1323            return re.match(r'<code>md5-[A-Fa-f0-9]{32}</code>', ''.join(peek_tokens))
1324
1325        tokens = []
1326        split_tokens = self._sorta_html_tokenize_re.split(text)
1327        is_html_markup = False
1328        for index, token in enumerate(split_tokens):
1329            if is_html_markup and not _is_auto_link(token) and not _is_code_span(index, token):
1330                sanitized = self._sanitize_html(token)
1331                key = _hash_text(sanitized)
1332                self.html_spans[key] = sanitized
1333                tokens.append(key)
1334            else:
1335                tokens.append(self._encode_incomplete_tags(token))
1336            is_html_markup = not is_html_markup
1337        return ''.join(tokens)
1338
1339    def _unhash_html_spans(self, text):
1340        for key, sanitized in list(self.html_spans.items()):
1341            text = text.replace(key, sanitized)
1342        return text
1343
1344    def _sanitize_html(self, s):
1345        if self.safe_mode == "replace":
1346            return self.html_removed_text
1347        elif self.safe_mode == "escape":
1348            replacements = [
1349                ('&', '&amp;'),
1350                ('<', '&lt;'),
1351                ('>', '&gt;'),
1352            ]
1353            for before, after in replacements:
1354                s = s.replace(before, after)
1355            return s
1356        else:
1357            raise MarkdownError("invalid value for 'safe_mode': %r (must be "
1358                                "'escape' or 'replace')" % self.safe_mode)
1359
1360    _inline_link_title = re.compile(r'''
1361            (                   # \1
1362              [ \t]+
1363              (['"])            # quote char = \2
1364              (?P<title>.*?)
1365              \2
1366            )?                  # title is optional
1367          \)$
1368        ''', re.X | re.S)
1369    _tail_of_reference_link_re = re.compile(r'''
1370          # Match tail of: [text][id]
1371          [ ]?          # one optional space
1372          (?:\n[ ]*)?   # one optional newline followed by spaces
1373          \[
1374            (?P<id>.*?)
1375          \]
1376        ''', re.X | re.S)
1377
1378    _whitespace = re.compile(r'\s*')
1379
1380    _strip_anglebrackets = re.compile(r'<(.*)>.*')
1381
1382    def _find_non_whitespace(self, text, start):
1383        """Returns the index of the first non-whitespace character in text
1384        after (and including) start
1385        """
1386        match = self._whitespace.match(text, start)
1387        return match.end()
1388
1389    def _find_balanced(self, text, start, open_c, close_c):
1390        """Returns the index where the open_c and close_c characters balance
1391        out - the same number of open_c and close_c are encountered - or the
1392        end of string if it's reached before the balance point is found.
1393        """
1394        i = start
1395        l = len(text)
1396        count = 1
1397        while count > 0 and i < l:
1398            if text[i] == open_c:
1399                count += 1
1400            elif text[i] == close_c:
1401                count -= 1
1402            i += 1
1403        return i
1404
1405    def _extract_url_and_title(self, text, start):
1406        """Extracts the url and (optional) title from the tail of a link"""
1407        # text[start] equals the opening parenthesis
1408        idx = self._find_non_whitespace(text, start + 1)
1409        if idx == len(text):
1410            return None, None, None
1411        end_idx = idx
1412        has_anglebrackets = text[idx] == "<"
1413        if has_anglebrackets:
1414            end_idx = self._find_balanced(text, end_idx + 1, "<", ">")
1415        end_idx = self._find_balanced(text, end_idx, "(", ")")
1416        match = self._inline_link_title.search(text, idx, end_idx)
1417        if not match:
1418            return None, None, None
1419        url, title = text[idx:match.start()], match.group("title")
1420        if has_anglebrackets:
1421            url = self._strip_anglebrackets.sub(r'\1', url)