pdoc.markdown2
A fast and complete Python implementation of Markdown.
[from http://daringfireball.net/projects/markdown/]
Markdown is a text-to-HTML filter; it translates an easy-to-read / easy-to-write structured text format into HTML. Markdown's text format is most similar to that of plain text email, and supports features such as headers, emphasis, code blocks, blockquotes, and links.
Markdown's syntax is designed not as a generic markup language, but specifically to serve as a front-end to (X)HTML. You can use span-level HTML tags anywhere in a Markdown document, and you can use block level HTML tags (like
andas well).
Module usage:
>>> import markdown2 >>> markdown2.markdown("*boo!*") # or use `html = markdown_path(PATH)` u'<p><em>boo!</em></p>\n' >>> markdowner = Markdown() >>> markdowner.convert("*boo!*") u'<p><em>boo!</em></p>\n' >>> markdowner.convert("**boom!**") u'<p><strong>boom!</strong></p>\n'
This implementation of Markdown implements the full "core" syntax plus a number of extras (e.g., code syntax coloring, footnotes) as described on https://github.com/trentm/python-markdown2/wiki/Extras.
1# fmt: off 2# flake8: noqa 3# type: ignore 4# Taken from here: https://github.com/trentm/python-markdown2/blob/ac5e7b956e9b8bc952039bfecb158ef1ddd7d422 5 6#!/usr/bin/env python 7# Copyright (c) 2012 Trent Mick. 8# Copyright (c) 2007-2008 ActiveState Corp. 9# License: MIT (http://www.opensource.org/licenses/mit-license.php) 10 11r"""A fast and complete Python implementation of Markdown. 12 13[from http://daringfireball.net/projects/markdown/] 14> Markdown is a text-to-HTML filter; it translates an easy-to-read / 15> easy-to-write structured text format into HTML. Markdown's text 16> format is most similar to that of plain text email, and supports 17> features such as headers, *emphasis*, code blocks, blockquotes, and 18> links. 19> 20> Markdown's syntax is designed not as a generic markup language, but 21> specifically to serve as a front-end to (X)HTML. You can use span-level 22> HTML tags anywhere in a Markdown document, and you can use block level 23> HTML tags (like <div> and <table> as well). 24 25Module usage: 26 27 >>> import markdown2 28 >>> markdown2.markdown("*boo!*") # or use `html = markdown_path(PATH)` 29 u'<p><em>boo!</em></p>\n' 30 31 >>> markdowner = Markdown() 32 >>> markdowner.convert("*boo!*") 33 u'<p><em>boo!</em></p>\n' 34 >>> markdowner.convert("**boom!**") 35 u'<p><strong>boom!</strong></p>\n' 36 37This implementation of Markdown implements the full "core" syntax plus a 38number of extras (e.g., code syntax coloring, footnotes) as described on 39<https://github.com/trentm/python-markdown2/wiki/Extras>. 40""" 41 42cmdln_desc = """A fast and complete Python implementation of Markdown, a 43text-to-HTML conversion tool for web writers. 44 45Supported extra syntax options (see -x|--extras option below and 46see <https://github.com/trentm/python-markdown2/wiki/Extras> for details): 47 48* admonitions: Enable parsing of RST admonitions. 49* break-on-newline: Replace single new line characters with <br> when True 50* code-friendly: Disable _ and __ for em and strong. 51* cuddled-lists: Allow lists to be cuddled to the preceding paragraph. 52* fenced-code-blocks: Allows a code block to not have to be indented 53 by fencing it with '```' on a line before and after. Based on 54 <http://github.github.com/github-flavored-markdown/> with support for 55 syntax highlighting. 56* footnotes: Support footnotes as in use on daringfireball.net and 57 implemented in other Markdown processors (tho not in Markdown.pl v1.0.1). 58* header-ids: Adds "id" attributes to headers. The id value is a slug of 59 the header text. 60* highlightjs-lang: Allows specifying the language which used for syntax 61 highlighting when using fenced-code-blocks and highlightjs. 62* html-classes: Takes a dict mapping html tag names (lowercase) to a 63 string to use for a "class" tag attribute. Currently only supports "img", 64 "table", "pre" and "code" tags. Add an issue if you require this for other 65 tags. 66* link-patterns: Auto-link given regex patterns in text (e.g. bug number 67 references, revision number references). 68* markdown-in-html: Allow the use of `markdown="1"` in a block HTML tag to 69 have markdown processing be done on its contents. Similar to 70 <http://michelf.com/projects/php-markdown/extra/#markdown-attr> but with 71 some limitations. 72* metadata: Extract metadata from a leading '---'-fenced block. 73 See <https://github.com/trentm/python-markdown2/issues/77> for details. 74* nofollow: Add `rel="nofollow"` to add `<a>` tags with an href. See 75 <http://en.wikipedia.org/wiki/Nofollow>. 76* numbering: Support of generic counters. Non standard extension to 77 allow sequential numbering of figures, tables, equations, exhibits etc. 78* pyshell: Treats unindented Python interactive shell sessions as <code> 79 blocks. 80* smarty-pants: Replaces ' and " with curly quotation marks or curly 81 apostrophes. Replaces --, ---, ..., and . . . with en dashes, em dashes, 82 and ellipses. 83* spoiler: A special kind of blockquote commonly hidden behind a 84 click on SO. Syntax per <http://meta.stackexchange.com/a/72878>. 85* strike: text inside of double tilde is ~~strikethrough~~ 86* tag-friendly: Requires atx style headers to have a space between the # and 87 the header text. Useful for applications that require twitter style tags to 88 pass through the parser. 89* tables: Tables using the same format as GFM 90 <https://help.github.com/articles/github-flavored-markdown#tables> and 91 PHP-Markdown Extra <https://michelf.ca/projects/php-markdown/extra/#table>. 92* toc: The returned HTML string gets a new "toc_html" attribute which is 93 a Table of Contents for the document. (experimental) 94* use-file-vars: Look for an Emacs-style markdown-extras file variable to turn 95 on Extras. 96* wiki-tables: Google Code Wiki-style tables. See 97 <http://code.google.com/p/support/wiki/WikiSyntax#Tables>. 98* xml: Passes one-liner processing instructions and namespaced XML tags. 99""" 100 101# Dev Notes: 102# - Python's regex syntax doesn't have '\z', so I'm using '\Z'. I'm 103# not yet sure if there implications with this. Compare 'pydoc sre' 104# and 'perldoc perlre'. 105 106__version_info__ = (2, 4, 4) 107__version__ = '.'.join(map(str, __version_info__)) 108__author__ = "Trent Mick" 109 110import sys 111import re 112import logging 113from hashlib import sha256 114import optparse 115from random import random, randint 116import codecs 117from collections import defaultdict 118 119# ---- globals 120 121DEBUG = False 122log = logging.getLogger("markdown") 123 124DEFAULT_TAB_WIDTH = 4 125 126SECRET_SALT = bytes(randint(0, 1000000)) 127 128 129# MD5 function was previously used for this; the "md5" prefix was kept for 130# backwards compatibility. 131def _hash_text(s): 132 return 'md5-' + sha256(SECRET_SALT + s.encode("utf-8")).hexdigest()[32:] 133 134 135# Table of hash values for escaped characters: 136g_escape_table = dict([(ch, _hash_text(ch)) 137 for ch in '\\`*_{}[]()>#+-.!']) 138 139# Ampersand-encoding based entirely on Nat Irons's Amputator MT plugin: 140# http://bumppo.net/projects/amputator/ 141_AMPERSAND_RE = re.compile(r'&(?!#?[xX]?(?:[0-9a-fA-F]+|\w+);)') 142 143 144# ---- exceptions 145class MarkdownError(Exception): 146 pass 147 148 149# ---- public api 150 151def markdown_path(path, encoding="utf-8", 152 html4tags=False, tab_width=DEFAULT_TAB_WIDTH, 153 safe_mode=None, extras=None, link_patterns=None, 154 footnote_title=None, footnote_return_symbol=None, 155 use_file_vars=False): 156 fp = codecs.open(path, 'r', encoding) 157 text = fp.read() 158 fp.close() 159 return Markdown(html4tags=html4tags, tab_width=tab_width, 160 safe_mode=safe_mode, extras=extras, 161 link_patterns=link_patterns, 162 footnote_title=footnote_title, 163 footnote_return_symbol=footnote_return_symbol, 164 use_file_vars=use_file_vars).convert(text) 165 166 167def markdown(text, html4tags=False, tab_width=DEFAULT_TAB_WIDTH, 168 safe_mode=None, extras=None, link_patterns=None, 169 footnote_title=None, footnote_return_symbol=None, 170 use_file_vars=False, cli=False): 171 return Markdown(html4tags=html4tags, tab_width=tab_width, 172 safe_mode=safe_mode, extras=extras, 173 link_patterns=link_patterns, 174 footnote_title=footnote_title, 175 footnote_return_symbol=footnote_return_symbol, 176 use_file_vars=use_file_vars, cli=cli).convert(text) 177 178 179class Markdown(object): 180 # The dict of "extras" to enable in processing -- a mapping of 181 # extra name to argument for the extra. Most extras do not have an 182 # argument, in which case the value is None. 183 # 184 # This can be set via (a) subclassing and (b) the constructor 185 # "extras" argument. 186 extras = None 187 188 urls = None 189 titles = None 190 html_blocks = None 191 html_spans = None 192 html_removed_text = "{(#HTML#)}" # placeholder removed text that does not trigger bold 193 html_removed_text_compat = "[HTML_REMOVED]" # for compat with markdown.py 194 195 _toc = None 196 197 # Used to track when we're inside an ordered or unordered list 198 # (see _ProcessListItems() for details): 199 list_level = 0 200 201 _ws_only_line_re = re.compile(r"^[ \t]+$", re.M) 202 203 def __init__(self, html4tags=False, tab_width=4, safe_mode=None, 204 extras=None, link_patterns=None, 205 footnote_title=None, footnote_return_symbol=None, 206 use_file_vars=False, cli=False): 207 if html4tags: 208 self.empty_element_suffix = ">" 209 else: 210 self.empty_element_suffix = " />" 211 self.tab_width = tab_width 212 self.tab = tab_width * " " 213 214 # For compatibility with earlier markdown2.py and with 215 # markdown.py's safe_mode being a boolean, 216 # safe_mode == True -> "replace" 217 if safe_mode is True: 218 self.safe_mode = "replace" 219 else: 220 self.safe_mode = safe_mode 221 222 # Massaging and building the "extras" info. 223 if self.extras is None: 224 self.extras = {} 225 elif not isinstance(self.extras, dict): 226 self.extras = dict([(e, None) for e in self.extras]) 227 if extras: 228 if not isinstance(extras, dict): 229 extras = dict([(e, None) for e in extras]) 230 self.extras.update(extras) 231 assert isinstance(self.extras, dict) 232 233 if "toc" in self.extras: 234 if "header-ids" not in self.extras: 235 self.extras["header-ids"] = None # "toc" implies "header-ids" 236 237 if self.extras["toc"] is None: 238 self._toc_depth = 6 239 else: 240 self._toc_depth = self.extras["toc"].get("depth", 6) 241 self._instance_extras = self.extras.copy() 242 243 if 'link-patterns' in self.extras: 244 if link_patterns is None: 245 # if you have specified that the link-patterns extra SHOULD 246 # be used (via self.extras) but you haven't provided anything 247 # via the link_patterns argument then an error is raised 248 raise MarkdownError("If the 'link-patterns' extra is used, an argument for 'link_patterns' is required") 249 self.link_patterns = link_patterns 250 self.footnote_title = footnote_title 251 self.footnote_return_symbol = footnote_return_symbol 252 self.use_file_vars = use_file_vars 253 self._outdent_re = re.compile(r'^(\t|[ ]{1,%d})' % tab_width, re.M) 254 self.cli = cli 255 256 self._escape_table = g_escape_table.copy() 257 self._code_table = {} 258 if "smarty-pants" in self.extras: 259 self._escape_table['"'] = _hash_text('"') 260 self._escape_table["'"] = _hash_text("'") 261 262 def reset(self): 263 self.urls = {} 264 self.titles = {} 265 self.html_blocks = {} 266 self.html_spans = {} 267 self.list_level = 0 268 self.extras = self._instance_extras.copy() 269 self._setup_extras() 270 self._toc = None 271 272 def _setup_extras(self): 273 if "footnotes" in self.extras: 274 self.footnotes = {} 275 self.footnote_ids = [] 276 if "header-ids" in self.extras: 277 self._count_from_header_id = defaultdict(int) 278 if "metadata" in self.extras: 279 self.metadata = {} 280 281 # Per <https://developer.mozilla.org/en-US/docs/HTML/Element/a> "rel" 282 # should only be used in <a> tags with an "href" attribute. 283 284 # Opens the linked document in a new window or tab 285 # should only used in <a> tags with an "href" attribute. 286 # same with _a_nofollow 287 _a_nofollow_or_blank_links = re.compile(r""" 288 <(a) 289 ( 290 [^>]* 291 href= # href is required 292 ['"]? # HTML5 attribute values do not have to be quoted 293 [^#'"] # We don't want to match href values that start with # (like footnotes) 294 ) 295 """, 296 re.IGNORECASE | re.VERBOSE 297 ) 298 299 def convert(self, text): 300 """Convert the given text.""" 301 # Main function. The order in which other subs are called here is 302 # essential. Link and image substitutions need to happen before 303 # _EscapeSpecialChars(), so that any *'s or _'s in the <a> 304 # and <img> tags get encoded. 305 306 # Clear the global hashes. If we don't clear these, you get conflicts 307 # from other articles when generating a page which contains more than 308 # one article (e.g. an index page that shows the N most recent 309 # articles): 310 self.reset() 311 312 if not isinstance(text, str): 313 # TODO: perhaps shouldn't presume UTF-8 for string input? 314 text = str(text, 'utf-8') 315 316 if self.use_file_vars: 317 # Look for emacs-style file variable hints. 318 text = self._emacs_oneliner_vars_pat.sub(self._emacs_vars_oneliner_sub, text) 319 emacs_vars = self._get_emacs_vars(text) 320 if "markdown-extras" in emacs_vars: 321 splitter = re.compile("[ ,]+") 322 for e in splitter.split(emacs_vars["markdown-extras"]): 323 if '=' in e: 324 ename, earg = e.split('=', 1) 325 try: 326 earg = int(earg) 327 except ValueError: 328 pass 329 else: 330 ename, earg = e, None 331 self.extras[ename] = earg 332 333 self._setup_extras() 334 335 # Standardize line endings: 336 text = text.replace("\r\n", "\n") 337 text = text.replace("\r", "\n") 338 339 # Make sure $text ends with a couple of newlines: 340 text += "\n\n" 341 342 # Convert all tabs to spaces. 343 text = self._detab(text) 344 345 # Strip any lines consisting only of spaces and tabs. 346 # This makes subsequent regexen easier to write, because we can 347 # match consecutive blank lines with /\n+/ instead of something 348 # contorted like /[ \t]*\n+/ . 349 text = self._ws_only_line_re.sub("", text) 350 351 # strip metadata from head and extract 352 if "metadata" in self.extras: 353 text = self._extract_metadata(text) 354 355 text = self.preprocess(text) 356 357 if "fenced-code-blocks" in self.extras and not self.safe_mode: 358 text = self._do_fenced_code_blocks(text) 359 360 if self.safe_mode: 361 text = self._hash_html_spans(text) 362 363 # Turn block-level HTML blocks into hash entries 364 text = self._hash_html_blocks(text, raw=True) 365 366 if "fenced-code-blocks" in self.extras and self.safe_mode: 367 text = self._do_fenced_code_blocks(text) 368 369 if 'admonitions' in self.extras: 370 text = self._do_admonitions(text) 371 372 # Because numbering references aren't links (yet?) then we can do everything associated with counters 373 # before we get started 374 if "numbering" in self.extras: 375 text = self._do_numbering(text) 376 377 # Strip link definitions, store in hashes. 378 if "footnotes" in self.extras: 379 # Must do footnotes first because an unlucky footnote defn 380 # looks like a link defn: 381 # [^4]: this "looks like a link defn" 382 text = self._strip_footnote_definitions(text) 383 text = self._strip_link_definitions(text) 384 385 text = self._run_block_gamut(text) 386 387 if "footnotes" in self.extras: 388 text = self._add_footnotes(text) 389 390 text = self.postprocess(text) 391 392 text = self._unescape_special_chars(text) 393 394 if self.safe_mode: 395 text = self._unhash_html_spans(text) 396 # return the removed text warning to its markdown.py compatible form 397 text = text.replace(self.html_removed_text, self.html_removed_text_compat) 398 399 do_target_blank_links = "target-blank-links" in self.extras 400 do_nofollow_links = "nofollow" in self.extras 401 402 if do_target_blank_links and do_nofollow_links: 403 text = self._a_nofollow_or_blank_links.sub(r'<\1 rel="nofollow noopener" target="_blank"\2', text) 404 elif do_target_blank_links: 405 text = self._a_nofollow_or_blank_links.sub(r'<\1 rel="noopener" target="_blank"\2', text) 406 elif do_nofollow_links: 407 text = self._a_nofollow_or_blank_links.sub(r'<\1 rel="nofollow"\2', text) 408 409 if "toc" in self.extras and self._toc: 410 self._toc_html = calculate_toc_html(self._toc) 411 412 # Prepend toc html to output 413 if self.cli: 414 text = '{}\n{}'.format(self._toc_html, text) 415 416 text += "\n" 417 418 # Attach attrs to output 419 rv = UnicodeWithAttrs(text) 420 421 if "toc" in self.extras and self._toc: 422 rv.toc_html = self._toc_html 423 424 if "metadata" in self.extras: 425 rv.metadata = self.metadata 426 return rv 427 428 def postprocess(self, text): 429 """A hook for subclasses to do some postprocessing of the html, if 430 desired. This is called before unescaping of special chars and 431 unhashing of raw HTML spans. 432 """ 433 return text 434 435 def preprocess(self, text): 436 """A hook for subclasses to do some preprocessing of the Markdown, if 437 desired. This is called after basic formatting of the text, but prior 438 to any extras, safe mode, etc. processing. 439 """ 440 return text 441 442 # Is metadata if the content starts with optional '---'-fenced `key: value` 443 # pairs. E.g. (indented for presentation): 444 # --- 445 # foo: bar 446 # another-var: blah blah 447 # --- 448 # # header 449 # or: 450 # foo: bar 451 # another-var: blah blah 452 # 453 # # header 454 _meta_data_pattern = re.compile(r''' 455 ^(?:---[\ \t]*\n)?( # optional opening fence 456 (?: 457 [\S \t]*\w[\S \t]*\s*:(?:\n+[ \t]+.*)+ # indented lists 458 )|(?: 459 (?:[\S \t]*\w[\S \t]*\s*:\s+>(?:\n\s+.*)+?) # multiline long descriptions 460 (?=\n[\S \t]*\w[\S \t]*\s*:\s*.*\n|\s*\Z) # match up until the start of the next key:value definition or the end of the input text 461 )|(?: 462 [\S \t]*\w[\S \t]*\s*:(?! >).*\n? # simple key:value pair, leading spaces allowed 463 ) 464 )(?:---[\ \t]*\n)? # optional closing fence 465 ''', re.MULTILINE | re.VERBOSE 466 ) 467 468 _key_val_list_pat = re.compile( 469 r"^-(?:[ \t]*([^\n]*)(?:[ \t]*[:-][ \t]*(\S+))?)(?:\n((?:[ \t]+[^\n]+\n?)+))?", 470 re.MULTILINE, 471 ) 472 _key_val_dict_pat = re.compile( 473 r"^([^:\n]+)[ \t]*:[ \t]*([^\n]*)(?:((?:\n[ \t]+[^\n]+)+))?", re.MULTILINE 474 ) # grp0: key, grp1: value, grp2: multiline value 475 _meta_data_fence_pattern = re.compile(r'^---[\ \t]*\n', re.MULTILINE) 476 _meta_data_newline = re.compile("^\n", re.MULTILINE) 477 478 def _extract_metadata(self, text): 479 if text.startswith("---"): 480 fence_splits = re.split(self._meta_data_fence_pattern, text, maxsplit=2) 481 metadata_content = fence_splits[1] 482 match = re.findall(self._meta_data_pattern, metadata_content) 483 if not match: 484 return text 485 tail = fence_splits[2] 486 else: 487 metadata_split = re.split(self._meta_data_newline, text, maxsplit=1) 488 metadata_content = metadata_split[0] 489 match = re.findall(self._meta_data_pattern, metadata_content) 490 if not match: 491 return text 492 tail = metadata_split[1] 493 494 def parse_structured_value(value): 495 vs = value.lstrip() 496 vs = value.replace(v[: len(value) - len(vs)], "\n")[1:] 497 498 # List 499 if vs.startswith("-"): 500 r = [] 501 for match in re.findall(self._key_val_list_pat, vs): 502 if match[0] and not match[1] and not match[2]: 503 r.append(match[0].strip()) 504 elif match[0] == ">" and not match[1] and match[2]: 505 r.append(match[2].strip()) 506 elif match[0] and match[1]: 507 r.append({match[0].strip(): match[1].strip()}) 508 elif not match[0] and not match[1] and match[2]: 509 r.append(parse_structured_value(match[2])) 510 else: 511 # Broken case 512 pass 513 514 return r 515 516 # Dict 517 else: 518 return { 519 match[0].strip(): ( 520 match[1].strip() 521 if match[1] 522 else parse_structured_value(match[2]) 523 ) 524 for match in re.findall(self._key_val_dict_pat, vs) 525 } 526 527 for item in match: 528 529 k, v = item.split(":", 1) 530 531 # Multiline value 532 if v[:3] == " >\n": 533 self.metadata[k.strip()] = _dedent(v[3:]).strip() 534 535 # Empty value 536 elif v == "\n": 537 self.metadata[k.strip()] = "" 538 539 # Structured value 540 elif v[0] == "\n": 541 self.metadata[k.strip()] = parse_structured_value(v) 542 543 # Simple value 544 else: 545 self.metadata[k.strip()] = v.strip() 546 547 return tail 548 549 _emacs_oneliner_vars_pat = re.compile(r"((?:<!--)?\s*-\*-)\s*(?:(\S[^\r\n]*?)([\r\n]\s*)?)?(-\*-\s*(?:-->)?)", 550 re.UNICODE) 551 # This regular expression is intended to match blocks like this: 552 # PREFIX Local Variables: SUFFIX 553 # PREFIX mode: Tcl SUFFIX 554 # PREFIX End: SUFFIX 555 # Some notes: 556 # - "[ \t]" is used instead of "\s" to specifically exclude newlines 557 # - "(\r\n|\n|\r)" is used instead of "$" because the sre engine does 558 # not like anything other than Unix-style line terminators. 559 _emacs_local_vars_pat = re.compile(r"""^ 560 (?P<prefix>(?:[^\r\n|\n|\r])*?) 561 [\ \t]*Local\ Variables:[\ \t]* 562 (?P<suffix>.*?)(?:\r\n|\n|\r) 563 (?P<content>.*?\1End:) 564 """, re.IGNORECASE | re.MULTILINE | re.DOTALL | re.VERBOSE) 565 566 def _emacs_vars_oneliner_sub(self, match): 567 if match.group(1).strip() == '-*-' and match.group(4).strip() == '-*-': 568 lead_ws = re.findall(r'^\s*', match.group(1))[0] 569 tail_ws = re.findall(r'\s*$', match.group(4))[0] 570 return '%s<!-- %s %s %s -->%s' % (lead_ws, '-*-', match.group(2).strip(), '-*-', tail_ws) 571 572 start, end = match.span() 573 return match.string[start: end] 574 575 def _get_emacs_vars(self, text): 576 """Return a dictionary of emacs-style local variables. 577 578 Parsing is done loosely according to this spec (and according to 579 some in-practice deviations from this): 580 http://www.gnu.org/software/emacs/manual/html_node/emacs/Specifying-File-Variables.html#Specifying-File-Variables 581 """ 582 emacs_vars = {} 583 SIZE = pow(2, 13) # 8kB 584 585 # Search near the start for a '-*-'-style one-liner of variables. 586 head = text[:SIZE] 587 if "-*-" in head: 588 match = self._emacs_oneliner_vars_pat.search(head) 589 if match: 590 emacs_vars_str = match.group(2) 591 assert '\n' not in emacs_vars_str 592 emacs_var_strs = [s.strip() for s in emacs_vars_str.split(';') 593 if s.strip()] 594 if len(emacs_var_strs) == 1 and ':' not in emacs_var_strs[0]: 595 # While not in the spec, this form is allowed by emacs: 596 # -*- Tcl -*- 597 # where the implied "variable" is "mode". This form 598 # is only allowed if there are no other variables. 599 emacs_vars["mode"] = emacs_var_strs[0].strip() 600 else: 601 for emacs_var_str in emacs_var_strs: 602 try: 603 variable, value = emacs_var_str.strip().split(':', 1) 604 except ValueError: 605 log.debug("emacs variables error: malformed -*- " 606 "line: %r", emacs_var_str) 607 continue 608 # Lowercase the variable name because Emacs allows "Mode" 609 # or "mode" or "MoDe", etc. 610 emacs_vars[variable.lower()] = value.strip() 611 612 tail = text[-SIZE:] 613 if "Local Variables" in tail: 614 match = self._emacs_local_vars_pat.search(tail) 615 if match: 616 prefix = match.group("prefix") 617 suffix = match.group("suffix") 618 lines = match.group("content").splitlines(0) 619 # print "prefix=%r, suffix=%r, content=%r, lines: %s"\ 620 # % (prefix, suffix, match.group("content"), lines) 621 622 # Validate the Local Variables block: proper prefix and suffix 623 # usage. 624 for i, line in enumerate(lines): 625 if not line.startswith(prefix): 626 log.debug("emacs variables error: line '%s' " 627 "does not use proper prefix '%s'" 628 % (line, prefix)) 629 return {} 630 # Don't validate suffix on last line. Emacs doesn't care, 631 # neither should we. 632 if i != len(lines) - 1 and not line.endswith(suffix): 633 log.debug("emacs variables error: line '%s' " 634 "does not use proper suffix '%s'" 635 % (line, suffix)) 636 return {} 637 638 # Parse out one emacs var per line. 639 continued_for = None 640 for line in lines[:-1]: # no var on the last line ("PREFIX End:") 641 if prefix: line = line[len(prefix):] # strip prefix 642 if suffix: line = line[:-len(suffix)] # strip suffix 643 line = line.strip() 644 if continued_for: 645 variable = continued_for 646 if line.endswith('\\'): 647 line = line[:-1].rstrip() 648 else: 649 continued_for = None 650 emacs_vars[variable] += ' ' + line 651 else: 652 try: 653 variable, value = line.split(':', 1) 654 except ValueError: 655 log.debug("local variables error: missing colon " 656 "in local variables entry: '%s'" % line) 657 continue 658 # Do NOT lowercase the variable name, because Emacs only 659 # allows "mode" (and not "Mode", "MoDe", etc.) in this block. 660 value = value.strip() 661 if value.endswith('\\'): 662 value = value[:-1].rstrip() 663 continued_for = variable 664 else: 665 continued_for = None 666 emacs_vars[variable] = value 667 668 # Unquote values. 669 for var, val in list(emacs_vars.items()): 670 if len(val) > 1 and (val.startswith('"') and val.endswith('"') 671 or val.startswith('"') and val.endswith('"')): 672 emacs_vars[var] = val[1:-1] 673 674 return emacs_vars 675 676 def _detab_line(self, line): 677 r"""Recusively convert tabs to spaces in a single line. 678 679 Called from _detab().""" 680 if '\t' not in line: 681 return line 682 chunk1, chunk2 = line.split('\t', 1) 683 chunk1 += (' ' * (self.tab_width - len(chunk1) % self.tab_width)) 684 output = chunk1 + chunk2 685 return self._detab_line(output) 686 687 def _detab(self, text): 688 r"""Iterate text line by line and convert tabs to spaces. 689 690 >>> m = Markdown() 691 >>> m._detab("\tfoo") 692 ' foo' 693 >>> m._detab(" \tfoo") 694 ' foo' 695 >>> m._detab("\t foo") 696 ' foo' 697 >>> m._detab(" foo") 698 ' foo' 699 >>> m._detab(" foo\n\tbar\tblam") 700 ' foo\n bar blam' 701 """ 702 if '\t' not in text: 703 return text 704 output = [] 705 for line in text.splitlines(): 706 output.append(self._detab_line(line)) 707 return '\n'.join(output) 708 709 # I broke out the html5 tags here and add them to _block_tags_a and 710 # _block_tags_b. This way html5 tags are easy to keep track of. 711 _html5tags = '|article|aside|header|hgroup|footer|nav|section|figure|figcaption' 712 713 _block_tags_a = 'p|div|h[1-6]|blockquote|pre|table|dl|ol|ul|script|noscript|form|fieldset|iframe|math|ins|del' 714 _block_tags_a += _html5tags 715 716 _strict_tag_block_re = re.compile(r""" 717 ( # save in \1 718 ^ # start of line (with re.M) 719 <(%s) # start tag = \2 720 \b # word break 721 (.*\n)*? # any number of lines, minimally matching 722 </\2> # the matching end tag 723 [ \t]* # trailing spaces/tabs 724 (?=\n+|\Z) # followed by a newline or end of document 725 ) 726 """ % _block_tags_a, 727 re.X | re.M) 728 729 _block_tags_b = 'p|div|h[1-6]|blockquote|pre|table|dl|ol|ul|script|noscript|form|fieldset|iframe|math' 730 _block_tags_b += _html5tags 731 732 _liberal_tag_block_re = re.compile(r""" 733 ( # save in \1 734 ^ # start of line (with re.M) 735 <(%s) # start tag = \2 736 \b # word break 737 (.*\n)*? # any number of lines, minimally matching 738 .*</\2> # the matching end tag 739 [ \t]* # trailing spaces/tabs 740 (?=\n+|\Z) # followed by a newline or end of document 741 ) 742 """ % _block_tags_b, 743 re.X | re.M) 744 745 _html_markdown_attr_re = re.compile( 746 r'''\s+markdown=("1"|'1')''') 747 748 def _hash_html_block_sub(self, match, raw=False): 749 html = match.group(1) 750 if raw and self.safe_mode: 751 html = self._sanitize_html(html) 752 elif 'markdown-in-html' in self.extras and 'markdown=' in html: 753 first_line = html.split('\n', 1)[0] 754 m = self._html_markdown_attr_re.search(first_line) 755 if m: 756 lines = html.split('\n') 757 middle = '\n'.join(lines[1:-1]) 758 last_line = lines[-1] 759 first_line = first_line[:m.start()] + first_line[m.end():] 760 f_key = _hash_text(first_line) 761 self.html_blocks[f_key] = first_line 762 l_key = _hash_text(last_line) 763 self.html_blocks[l_key] = last_line 764 return ''.join(["\n\n", f_key, 765 "\n\n", middle, "\n\n", 766 l_key, "\n\n"]) 767 key = _hash_text(html) 768 self.html_blocks[key] = html 769 return "\n\n" + key + "\n\n" 770 771 def _hash_html_blocks(self, text, raw=False): 772 """Hashify HTML blocks 773 774 We only want to do this for block-level HTML tags, such as headers, 775 lists, and tables. That's because we still want to wrap <p>s around 776 "paragraphs" that are wrapped in non-block-level tags, such as anchors, 777 phrase emphasis, and spans. The list of tags we're looking for is 778 hard-coded. 779 780 @param raw {boolean} indicates if these are raw HTML blocks in 781 the original source. It makes a difference in "safe" mode. 782 """ 783 if '<' not in text: 784 return text 785 786 # Pass `raw` value into our calls to self._hash_html_block_sub. 787 hash_html_block_sub = _curry(self._hash_html_block_sub, raw=raw) 788 789 # First, look for nested blocks, e.g.: 790 # <div> 791 # <div> 792 # tags for inner block must be indented. 793 # </div> 794 # </div> 795 # 796 # The outermost tags must start at the left margin for this to match, and 797 # the inner nested divs must be indented. 798 # We need to do this before the next, more liberal match, because the next 799 # match will start at the first `<div>` and stop at the first `</div>`. 800 text = self._strict_tag_block_re.sub(hash_html_block_sub, text) 801 802 # Now match more liberally, simply from `\n<tag>` to `</tag>\n` 803 text = self._liberal_tag_block_re.sub(hash_html_block_sub, text) 804 805 # Special case just for <hr />. It was easier to make a special 806 # case than to make the other regex more complicated. 807 if "<hr" in text: 808 _hr_tag_re = _hr_tag_re_from_tab_width(self.tab_width) 809 text = _hr_tag_re.sub(hash_html_block_sub, text) 810 811 # Special case for standalone HTML comments: 812 if "<!--" in text: 813 start = 0 814 while True: 815 # Delimiters for next comment block. 816 try: 817 start_idx = text.index("<!--", start) 818 except ValueError: 819 break 820 try: 821 end_idx = text.index("-->", start_idx) + 3 822 except ValueError: 823 break 824 825 # Start position for next comment block search. 826 start = end_idx 827 828 # Validate whitespace before comment. 829 if start_idx: 830 # - Up to `tab_width - 1` spaces before start_idx. 831 for i in range(self.tab_width - 1): 832 if text[start_idx - 1] != ' ': 833 break 834 start_idx -= 1 835 if start_idx == 0: 836 break 837 # - Must be preceded by 2 newlines or hit the start of 838 # the document. 839 if start_idx == 0: 840 pass 841 elif start_idx == 1 and text[0] == '\n': 842 start_idx = 0 # to match minute detail of Markdown.pl regex 843 elif text[start_idx - 2:start_idx] == '\n\n': 844 pass 845 else: 846 break 847 848 # Validate whitespace after comment. 849 # - Any number of spaces and tabs. 850 while end_idx < len(text): 851 if text[end_idx] not in ' \t': 852 break 853 end_idx += 1 854 # - Must be following by 2 newlines or hit end of text. 855 if text[end_idx:end_idx + 2] not in ('', '\n', '\n\n'): 856 continue 857 858 # Escape and hash (must match `_hash_html_block_sub`). 859 html = text[start_idx:end_idx] 860 if raw and self.safe_mode: 861 html = self._sanitize_html(html) 862 key = _hash_text(html) 863 self.html_blocks[key] = html 864 text = text[:start_idx] + "\n\n" + key + "\n\n" + text[end_idx:] 865 866 if "xml" in self.extras: 867 # Treat XML processing instructions and namespaced one-liner 868 # tags as if they were block HTML tags. E.g., if standalone 869 # (i.e. are their own paragraph), the following do not get 870 # wrapped in a <p> tag: 871 # <?foo bar?> 872 # 873 # <xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="chapter_1.md"/> 874 _xml_oneliner_re = _xml_oneliner_re_from_tab_width(self.tab_width) 875 text = _xml_oneliner_re.sub(hash_html_block_sub, text) 876 877 return text 878 879 def _strip_link_definitions(self, text): 880 # Strips link definitions from text, stores the URLs and titles in 881 # hash references. 882 less_than_tab = self.tab_width - 1 883 884 # Link defs are in the form: 885 # [id]: url "optional title" 886 _link_def_re = re.compile(r""" 887 ^[ ]{0,%d}\[(.+)\]: # id = \1 888 [ \t]* 889 \n? # maybe *one* newline 890 [ \t]* 891 <?(.+?)>? # url = \2 892 [ \t]* 893 (?: 894 \n? # maybe one newline 895 [ \t]* 896 (?<=\s) # lookbehind for whitespace 897 ['"(] 898 ([^\n]*) # title = \3 899 ['")] 900 [ \t]* 901 )? # title is optional 902 (?:\n+|\Z) 903 """ % less_than_tab, re.X | re.M | re.U) 904 return _link_def_re.sub(self._extract_link_def_sub, text) 905 906 def _extract_link_def_sub(self, match): 907 id, url, title = match.groups() 908 key = id.lower() # Link IDs are case-insensitive 909 self.urls[key] = self._encode_amps_and_angles(url) 910 if title: 911 self.titles[key] = title 912 return "" 913 914 def _do_numbering(self, text): 915 ''' We handle the special extension for generic numbering for 916 tables, figures etc. 917 ''' 918 # First pass to define all the references 919 self.regex_defns = re.compile(r''' 920 \[\#(\w+) # the counter. Open square plus hash plus a word \1 921 ([^@]*) # Some optional characters, that aren't an @. \2 922 @(\w+) # the id. Should this be normed? \3 923 ([^\]]*)\] # The rest of the text up to the terminating ] \4 924 ''', re.VERBOSE) 925 self.regex_subs = re.compile(r"\[@(\w+)\s*\]") # [@ref_id] 926 counters = {} 927 references = {} 928 replacements = [] 929 definition_html = '<figcaption class="{}" id="counter-ref-{}">{}{}{}</figcaption>' 930 reference_html = '<a class="{}" href="#counter-ref-{}">{}</a>' 931 for match in self.regex_defns.finditer(text): 932 # We must have four match groups otherwise this isn't a numbering reference 933 if len(match.groups()) != 4: 934 continue 935 counter = match.group(1) 936 text_before = match.group(2).strip() 937 ref_id = match.group(3) 938 text_after = match.group(4) 939 number = counters.get(counter, 1) 940 references[ref_id] = (number, counter) 941 replacements.append((match.start(0), 942 definition_html.format(counter, 943 ref_id, 944 text_before, 945 number, 946 text_after), 947 match.end(0))) 948 counters[counter] = number + 1 949 for repl in reversed(replacements): 950 text = text[:repl[0]] + repl[1] + text[repl[2]:] 951 952 # Second pass to replace the references with the right 953 # value of the counter 954 # Fwiw, it's vaguely annoying to have to turn the iterator into 955 # a list and then reverse it but I can't think of a better thing to do. 956 for match in reversed(list(self.regex_subs.finditer(text))): 957 number, counter = references.get(match.group(1), (None, None)) 958 if number is not None: 959 repl = reference_html.format(counter, 960 match.group(1), 961 number) 962 else: 963 repl = reference_html.format(match.group(1), 964 'countererror', 965 '?' + match.group(1) + '?') 966 if "smarty-pants" in self.extras: 967 repl = repl.replace('"', self._escape_table['"']) 968 969 text = text[:match.start()] + repl + text[match.end():] 970 return text 971 972 def _extract_footnote_def_sub(self, match): 973 id, text = match.groups() 974 text = _dedent(text, skip_first_line=not text.startswith('\n')).strip() 975 normed_id = re.sub(r'\W', '-', id) 976 # Ensure footnote text ends with a couple newlines (for some 977 # block gamut matches). 978 self.footnotes[normed_id] = text + "\n\n" 979 return "" 980 981 def _strip_footnote_definitions(self, text): 982 """A footnote definition looks like this: 983 984 [^note-id]: Text of the note. 985 986 May include one or more indented paragraphs. 987 988 Where, 989 - The 'note-id' can be pretty much anything, though typically it 990 is the number of the footnote. 991 - The first paragraph may start on the next line, like so: 992 993 [^note-id]: 994 Text of the note. 995 """ 996 less_than_tab = self.tab_width - 1 997 footnote_def_re = re.compile(r''' 998 ^[ ]{0,%d}\[\^(.+)\]: # id = \1 999 [ \t]* 1000 ( # footnote text = \2 1001 # First line need not start with the spaces. 1002 (?:\s*.*\n+) 1003 (?: 1004 (?:[ ]{%d} | \t) # Subsequent lines must be indented. 1005 .*\n+ 1006 )* 1007 ) 1008 # Lookahead for non-space at line-start, or end of doc. 1009 (?:(?=^[ ]{0,%d}\S)|\Z) 1010 ''' % (less_than_tab, self.tab_width, self.tab_width), 1011 re.X | re.M) 1012 return footnote_def_re.sub(self._extract_footnote_def_sub, text) 1013 1014 _hr_re = re.compile(r'^[ ]{0,3}([-_*])[ ]{0,2}(\1[ ]{0,2}){2,}$', re.M) 1015 1016 def _run_block_gamut(self, text): 1017 # These are all the transformations that form block-level 1018 # tags like paragraphs, headers, and list items. 1019 1020 if 'admonitions' in self.extras: 1021 text = self._do_admonitions(text) 1022 1023 if "fenced-code-blocks" in self.extras: 1024 text = self._do_fenced_code_blocks(text) 1025 1026 text = self._do_headers(text) 1027 1028 # Do Horizontal Rules: 1029 # On the number of spaces in horizontal rules: The spec is fuzzy: "If 1030 # you wish, you may use spaces between the hyphens or asterisks." 1031 # Markdown.pl 1.0.1's hr regexes limit the number of spaces between the 1032 # hr chars to one or two. We'll reproduce that limit here. 1033 hr = "\n<hr" + self.empty_element_suffix + "\n" 1034 text = re.sub(self._hr_re, hr, text) 1035 1036 text = self._do_lists(text) 1037 1038 if "pyshell" in self.extras: 1039 text = self._prepare_pyshell_blocks(text) 1040 if "wiki-tables" in self.extras: 1041 text = self._do_wiki_tables(text) 1042 if "tables" in self.extras: 1043 text = self._do_tables(text) 1044 1045 text = self._do_code_blocks(text) 1046 1047 text = self._do_block_quotes(text) 1048 1049 # We already ran _HashHTMLBlocks() before, in Markdown(), but that 1050 # was to escape raw HTML in the original Markdown source. This time, 1051 # we're escaping the markup we've just created, so that we don't wrap 1052 # <p> tags around block-level tags. 1053 text = self._hash_html_blocks(text) 1054 1055 text = self._form_paragraphs(text) 1056 1057 return text 1058 1059 def _pyshell_block_sub(self, match): 1060 if "fenced-code-blocks" in self.extras: 1061 dedented = _dedent(match.group(0)) 1062 return self._do_fenced_code_blocks("```pycon\n" + dedented + "```\n") 1063 lines = match.group(0).splitlines(0) 1064 _dedentlines(lines) 1065 indent = ' ' * self.tab_width 1066 s = ('\n' # separate from possible cuddled paragraph 1067 + indent + ('\n' + indent).join(lines) 1068 + '\n') 1069 return s 1070 1071 def _prepare_pyshell_blocks(self, text): 1072 """Ensure that Python interactive shell sessions are put in 1073 code blocks -- even if not properly indented. 1074 """ 1075 if ">>>" not in text: 1076 return text 1077 1078 less_than_tab = self.tab_width - 1 1079 _pyshell_block_re = re.compile(r""" 1080 ^([ ]{0,%d})>>>[ ].*\n # first line 1081 ^(\1[^\S\n]*\S.*\n)* # any number of subsequent lines with at least one character 1082 (?=^\1?\n|\Z) # ends with a blank line or end of document 1083 """ % less_than_tab, re.M | re.X) 1084 1085 return _pyshell_block_re.sub(self._pyshell_block_sub, text) 1086 1087 def _table_sub(self, match): 1088 trim_space_re = '^[ \t\n]+|[ \t\n]+$' 1089 trim_bar_re = r'^\||\|$' 1090 split_bar_re = r'^\||(?<![\`\\])\|' 1091 escape_bar_re = r'\\\|' 1092 1093 head, underline, body = match.groups() 1094 1095 # Determine aligns for columns. 1096 cols = [re.sub(escape_bar_re, '|', cell.strip()) for cell in 1097 re.split(split_bar_re, re.sub(trim_bar_re, "", re.sub(trim_space_re, "", underline)))] 1098 align_from_col_idx = {} 1099 for col_idx, col in enumerate(cols): 1100 if col[0] == ':' and col[-1] == ':': 1101 align_from_col_idx[col_idx] = ' style="text-align:center;"' 1102 elif col[0] == ':': 1103 align_from_col_idx[col_idx] = ' style="text-align:left;"' 1104 elif col[-1] == ':': 1105 align_from_col_idx[col_idx] = ' style="text-align:right;"' 1106 1107 # thead 1108 hlines = ['<table%s>' % self._html_class_str_from_tag('table'), '<thead>', '<tr>'] 1109 cols = [re.sub(escape_bar_re, '|', cell.strip()) for cell in 1110 re.split(split_bar_re, re.sub(trim_bar_re, "", re.sub(trim_space_re, "", head)))] 1111 for col_idx, col in enumerate(cols): 1112 hlines.append(' <th%s>%s</th>' % ( 1113 align_from_col_idx.get(col_idx, ''), 1114 self._run_span_gamut(col) 1115 )) 1116 hlines.append('</tr>') 1117 hlines.append('</thead>') 1118 1119 # tbody 1120 hlines.append('<tbody>') 1121 for line in body.strip('\n').split('\n'): 1122 hlines.append('<tr>') 1123 cols = [re.sub(escape_bar_re, '|', cell.strip()) for cell in 1124 re.split(split_bar_re, re.sub(trim_bar_re, "", re.sub(trim_space_re, "", line)))] 1125 for col_idx, col in enumerate(cols): 1126 hlines.append(' <td%s>%s</td>' % ( 1127 align_from_col_idx.get(col_idx, ''), 1128 self._run_span_gamut(col) 1129 )) 1130 hlines.append('</tr>') 1131 hlines.append('</tbody>') 1132 hlines.append('</table>') 1133 1134 return '\n'.join(hlines) + '\n' 1135 1136 def _do_tables(self, text): 1137 """Copying PHP-Markdown and GFM table syntax. Some regex borrowed from 1138 https://github.com/michelf/php-markdown/blob/lib/Michelf/Markdown.php#L2538 1139 """ 1140 less_than_tab = self.tab_width - 1 1141 table_re = re.compile(r''' 1142 (?:(?<=\n\n)|\A\n?) # leading blank line 1143 1144 ^[ ]{0,%d} # allowed whitespace 1145 (.*[|].*) \n # $1: header row (at least one pipe) 1146 1147 ^[ ]{0,%d} # allowed whitespace 1148 ( # $2: underline row 1149 # underline row with leading bar 1150 (?: \|\ *:?-+:?\ * )+ \|? \s? \n 1151 | 1152 # or, underline row without leading bar 1153 (?: \ *:?-+:?\ *\| )+ (?: \ *:?-+:?\ * )? \s? \n 1154 ) 1155 1156 ( # $3: data rows 1157 (?: 1158 ^[ ]{0,%d}(?!\ ) # ensure line begins with 0 to less_than_tab spaces 1159 .*\|.* \n 1160 )+ 1161 ) 1162 ''' % (less_than_tab, less_than_tab, less_than_tab), re.M | re.X) 1163 return table_re.sub(self._table_sub, text) 1164 1165 def _wiki_table_sub(self, match): 1166 ttext = match.group(0).strip() 1167 # print('wiki table: %r' % match.group(0)) 1168 rows = [] 1169 for line in ttext.splitlines(0): 1170 line = line.strip()[2:-2].strip() 1171 row = [c.strip() for c in re.split(r'(?<!\\)\|\|', line)] 1172 rows.append(row) 1173 # from pprint import pprint 1174 # pprint(rows) 1175 hlines = [] 1176 1177 def add_hline(line, indents=0): 1178 hlines.append((self.tab * indents) + line) 1179 1180 def format_cell(text): 1181 return self._run_span_gamut(re.sub(r"^\s*~", "", cell).strip(" ")) 1182 1183 add_hline('<table%s>' % self._html_class_str_from_tag('table')) 1184 # Check if first cell of first row is a header cell. If so, assume the whole row is a header row. 1185 if rows and rows[0] and re.match(r"^\s*~", rows[0][0]): 1186 add_hline('<thead>', 1) 1187 add_hline('<tr>', 2) 1188 for cell in rows[0]: 1189 add_hline("<th>{}</th>".format(format_cell(cell)), 3) 1190 add_hline('</tr>', 2) 1191 add_hline('</thead>', 1) 1192 # Only one header row allowed. 1193 rows = rows[1:] 1194 # If no more rows, don't create a tbody. 1195 if rows: 1196 add_hline('<tbody>', 1) 1197 for row in rows: 1198 add_hline('<tr>', 2) 1199 for cell in row: 1200 add_hline('<td>{}</td>'.format(format_cell(cell)), 3) 1201 add_hline('</tr>', 2) 1202 add_hline('</tbody>', 1) 1203 add_hline('</table>') 1204 return '\n'.join(hlines) + '\n' 1205 1206 def _do_wiki_tables(self, text): 1207 # Optimization. 1208 if "||" not in text: 1209 return text 1210 1211 less_than_tab = self.tab_width - 1 1212 wiki_table_re = re.compile(r''' 1213 (?:(?<=\n\n)|\A\n?) # leading blank line 1214 ^([ ]{0,%d})\|\|.+?\|\|[ ]*\n # first line 1215 (^\1\|\|.+?\|\|\n)* # any number of subsequent lines 1216 ''' % less_than_tab, re.M | re.X) 1217 return wiki_table_re.sub(self._wiki_table_sub, text) 1218 1219 def _run_span_gamut(self, text): 1220 # These are all the transformations that occur *within* block-level 1221 # tags like paragraphs, headers, and list items. 1222 1223 text = self._do_code_spans(text) 1224 1225 text = self._escape_special_chars(text) 1226 1227 # Process anchor and image tags. 1228 if "link-patterns" in self.extras: 1229 text = self._do_link_patterns(text) 1230 1231 text = self._do_links(text) 1232 1233 # Make links out of things like `<http://example.com/>` 1234 # Must come after _do_links(), because you can use < and > 1235 # delimiters in inline links like [this](<url>). 1236 text = self._do_auto_links(text) 1237 1238 text = self._encode_amps_and_angles(text) 1239 1240 if "strike" in self.extras: 1241 text = self._do_strike(text) 1242 1243 if "underline" in self.extras: 1244 text = self._do_underline(text) 1245 1246 text = self._do_italics_and_bold(text) 1247 1248 if "smarty-pants" in self.extras: 1249 text = self._do_smart_punctuation(text) 1250 1251 # Do hard breaks: 1252 if "break-on-newline" in self.extras: 1253 text = re.sub(r" *\n(?!\<(?:\/?(ul|ol|li))\>)", "<br%s\n" % self.empty_element_suffix, text) 1254 else: 1255 text = re.sub(r" {2,}\n", " <br%s\n" % self.empty_element_suffix, text) 1256 1257 return text 1258 1259 # "Sorta" because auto-links are identified as "tag" tokens. 1260 _sorta_html_tokenize_re = re.compile(r""" 1261 ( 1262 # tag 1263 </? 1264 (?:\w+) # tag name 1265 (?:\s+(?:[\w-]+:)?[\w-]+=(?:".*?"|'.*?'))* # attributes 1266 \s*/?> 1267 | 1268 # auto-link (e.g., <http://www.activestate.com/>) 1269 <[\w~:/?#\[\]@!$&'\(\)*+,;%=\.\\-]+> 1270 | 1271 <!--.*?--> # comment 1272 | 1273 <\?.*?\?> # processing instruction 1274 ) 1275 """, re.X) 1276 1277 def _escape_special_chars(self, text): 1278 # Python markdown note: the HTML tokenization here differs from 1279 # that in Markdown.pl, hence the behaviour for subtle cases can 1280 # differ (I believe the tokenizer here does a better job because 1281 # it isn't susceptible to unmatched '<' and '>' in HTML tags). 1282 # Note, however, that '>' is not allowed in an auto-link URL 1283 # here. 1284 escaped = [] 1285 is_html_markup = False 1286 for token in self._sorta_html_tokenize_re.split(text): 1287 if is_html_markup: 1288 # Within tags/HTML-comments/auto-links, encode * and _ 1289 # so they don't conflict with their use in Markdown for 1290 # italics and strong. We're replacing each such 1291 # character with its corresponding MD5 checksum value; 1292 # this is likely overkill, but it should prevent us from 1293 # colliding with the escape values by accident. 1294 escaped.append(token.replace('*', self._escape_table['*']) 1295 .replace('_', self._escape_table['_'])) 1296 else: 1297 escaped.append(self._encode_backslash_escapes(token)) 1298 is_html_markup = not is_html_markup 1299 return ''.join(escaped) 1300 1301 def _hash_html_spans(self, text): 1302 # Used for safe_mode. 1303 1304 def _is_auto_link(s): 1305 if ':' in s and self._auto_link_re.match(s): 1306 return True 1307 elif '@' in s and self._auto_email_link_re.match(s): 1308 return True 1309 return False 1310 1311 def _is_code_span(index, token): 1312 try: 1313 if token == '<code>': 1314 peek_tokens = split_tokens[index: index + 3] 1315 elif token == '</code>': 1316 peek_tokens = split_tokens[index - 2: index + 1] 1317 else: 1318 return False 1319 except IndexError: 1320 return False 1321 1322 return re.match(r'<code>md5-[A-Fa-f0-9]{32}</code>', ''.join(peek_tokens)) 1323 1324 tokens = [] 1325 split_tokens = self._sorta_html_tokenize_re.split(text) 1326 is_html_markup = False 1327 for index, token in enumerate(split_tokens): 1328 if is_html_markup and not _is_auto_link(token) and not _is_code_span(index, token): 1329 sanitized = self._sanitize_html(token) 1330 key = _hash_text(sanitized) 1331 self.html_spans[key] = sanitized 1332 tokens.append(key) 1333 else: 1334 tokens.append(self._encode_incomplete_tags(token)) 1335 is_html_markup = not is_html_markup 1336 return ''.join(tokens) 1337 1338 def _unhash_html_spans(self, text): 1339 for key, sanitized in list(self.html_spans.items()): 1340 text = text.replace(key, sanitized) 1341 return text 1342 1343 def _sanitize_html(self, s): 1344 if self.safe_mode == "replace": 1345 return self.html_removed_text 1346 elif self.safe_mode == "escape": 1347 replacements = [ 1348 ('&', '&'), 1349 ('<', '<'), 1350 ('>', '>'), 1351 ] 1352 for before, after in replacements: 1353 s = s.replace(before, after) 1354 return s 1355 else: 1356 raise MarkdownError("invalid value for 'safe_mode': %r (must be " 1357 "'escape' or 'replace')" % self.safe_mode) 1358 1359 _inline_link_title = re.compile(r''' 1360 ( # \1 1361 [ \t]+ 1362 (['"]) # quote char = \2 1363 (?P<title>.*?) 1364 \2 1365 )? # title is optional 1366 \)$ 1367 ''', re.X | re.S) 1368 _tail_of_reference_link_re = re.compile(r''' 1369 # Match tail of: [text][id] 1370 [ ]? # one optional space 1371 (?:\n[ ]*)? # one optional newline followed by spaces 1372 \[ 1373 (?P<id>.*?) 1374 \] 1375 ''', re.X | re.S) 1376 1377 _whitespace = re.compile(r'\s*') 1378 1379 _strip_anglebrackets = re.compile(r'<(.*)>.*') 1380 1381 def _find_non_whitespace(self, text, start): 1382 """Returns the index of the first non-whitespace character in text 1383 after (and including) start 1384 """ 1385 match = self._whitespace.match(text, start) 1386 return match.end() 1387 1388 def _find_balanced(self, text, start, open_c, close_c): 1389 """Returns the index where the open_c and close_c characters balance 1390 out - the same number of open_c and close_c are encountered - or the 1391 end of string if it's reached before the balance point is found. 1392 """ 1393 i = start 1394 l = len(text) 1395 count = 1 1396 while count > 0 and i < l: 1397 if text[i] == open_c: 1398 count += 1 1399 elif text[i] == close_c: 1400 count -= 1 1401 i += 1 1402 return i 1403 1404 def _extract_url_and_title(self, text, start): 1405 """Extracts the url and (optional) title from the tail of a link""" 1406 # text[start] equals the opening parenthesis 1407 idx = self._find_non_whitespace(text, start + 1) 1408 if idx == len(text): 1409 return None, None, None 1410 end_idx = idx 1411 has_anglebrackets = text[idx] == "<" 1412 if has_anglebrackets: 1413 end_idx = self._find_balanced(text, end_idx + 1, "<", ">") 1414 end_idx = self._find_balanced(text, end_idx, "(", ")") 1415 match = self._inline_link_title.search(text, idx, end_idx) 1416 if not match: 1417 return None, None, None 1418 url, title = text[idx:match.start()], match.group("title") 1419 if has_anglebrackets: 1420 url = self._strip_anglebrackets.sub(r'\1', url) 1421 return url, title, end_idx 1422 1423 _safe_protocols = re.compile(r'(https?|ftp):', re.I) 1424 1425 def _do_links(self, text): 1426 """Turn Markdown link shortcuts into XHTML <a> and <img> tags. 1427 1428 This is a combination of Markdown.pl's _DoAnchors() and 1429 _DoImages(). They are done together because that simplified the 1430 approach. It was necessary to use a different approach than 1431 Markdown.pl because of the lack of atomic matching support in 1432 Python's regex engine used in $g_nested_brackets. 1433 """ 1434 MAX_LINK_TEXT_SENTINEL = 3000 # markdown2 issue 24 1435 1436 # `anchor_allowed_pos` is used to support img links inside 1437 # anchors, but not anchors inside anchors. An anchor's start 1438 # pos must be `>= anchor_allowed_pos`. 1439 anchor_allowed_pos = 0 1440 1441 curr_pos = 0 1442 while True: # Handle the next link. 1443 # The next '[' is the start of: 1444 # - an inline anchor: [text](url "title") 1445 # - a reference anchor: [text][id] 1446 # - an inline img:  1447 # - a reference img: ![text][id] 1448 # - a footnote ref: [^id] 1449 # (Only if 'footnotes' extra enabled) 1450 # - a footnote defn: [^id]: ... 1451 # (Only if 'footnotes' extra enabled) These have already 1452 # been stripped in _strip_footnote_definitions() so no 1453 # need to watch for them. 1454 # - a link definition: [id]: url "title" 1455 # These have already been stripped in 1456 # _strip_link_definitions() so no need to watch for them. 1457 # - not markup: [...anything else... 1458 try: 1459 start_idx = text.index('[', curr_pos) 1460 except ValueError: 1461 break 1462 text_length = len(text) 1463 1464 # Find the matching closing ']'. 1465 # Markdown.pl allows *matching* brackets in link text so we 1466 # will here too. Markdown.pl *doesn't* currently allow 1467 # matching brackets in img alt text -- we'll differ in that 1468 # regard. 1469 bracket_depth = 0 1470 for p in range(start_idx + 1, min(start_idx + MAX_LINK_TEXT_SENTINEL, 1471 text_length)): 1472 ch = text[p] 1473 if ch == ']': 1474 bracket_depth -= 1 1475 if bracket_depth < 0: 1476 break 1477 elif ch == '[': 1478 bracket_depth += 1 1479 else: 1480 # Closing bracket not found within sentinel length. 1481 # This isn't markup. 1482 curr_pos = start_idx + 1 1483 continue 1484 link_text = text[start_idx + 1:p] 1485 1486 # Fix for issue 341 - Injecting XSS into link text 1487 if self.safe_mode: 1488 link_text = self._hash_html_spans(link_text) 1489 link_text = self._unhash_html_spans(link_text) 1490 1491 # Possibly a footnote ref? 1492 if "footnotes" in self.extras and link_text.startswith("^"): 1493 normed_id = re.sub(r'\W', '-', link_text[1:]) 1494 if normed_id in self.footnotes: 1495 self.footnote_ids.append(normed_id) 1496 result = '<sup class="footnote-ref" id="fnref-%s">' \ 1497 '<a href="#fn-%s">%s</a></sup>' \ 1498 % (normed_id, normed_id, len(self.footnote_ids)) 1499 text = text[:start_idx] + result + text[p + 1:] 1500 else: 1501 # This id isn't defined, leave the markup alone. 1502 curr_pos = p + 1 1503 continue 1504 1505 # Now determine what this is by the remainder. 1506 p += 1 1507 1508 # Inline anchor or img? 1509 if text[p:p + 1] == '(': # attempt at perf improvement 1510 url, title, url_end_idx = self._extract_url_and_title(text, p) 1511 if url is not None: 1512 # Handle an inline anchor or img. 1513 is_img = start_idx > 0 and text[start_idx - 1] == "!" 1514 if is_img: 1515 start_idx -= 1 1516 1517 # We've got to encode these to avoid conflicting 1518 # with italics/bold. 1519 url = url.replace('*', self._escape_table['*']) \ 1520 .replace('_', self._escape_table['_']) 1521 if title: 1522 title_str = ' title="%s"' % ( 1523 _xml_escape_attr(title) 1524 .replace('*', self._escape_table['*']) 1525 .replace('_', self._escape_table['_'])) 1526 else: 1527 title_str = '' 1528 if is_img: 1529 img_class_str = self._html_class_str_from_tag("img") 1530 result = '<img src="%s" alt="%s"%s%s%s' \ 1531 % (_html_escape_url(url, safe_mode=self.safe_mode), 1532 _xml_escape_attr(link_text), 1533 title_str, 1534 img_class_str, 1535 self.empty_element_suffix) 1536 if "smarty-pants" in self.extras: 1537 result = result.replace('"', self._escape_table['"']) 1538 curr_pos = start_idx + len(result) 1539 text = text[:start_idx] + result + text[url_end_idx:] 1540 elif start_idx >= anchor_allowed_pos: 1541 safe_link = self._safe_protocols.match(url) or url.startswith('#') 1542 if self.safe_mode and not safe_link: 1543 result_head = '<a href="#"%s>' % (title_str) 1544 else: 1545 result_head = '<a href="%s"%s>' % ( 1546 _html_escape_url(url, safe_mode=self.safe_mode), title_str) 1547 result = '%s%s</a>' % (result_head, link_text) 1548 if "smarty-pants" in self.extras: 1549 result = result.replace('"', self._escape_table['"']) 1550 # <img> allowed from curr_pos on, <a> from 1551 # anchor_allowed_pos on. 1552 curr_pos = start_idx + len(result_head) 1553 anchor_allowed_pos = start_idx + len(result) 1554 text = text[:start_idx] + result + text[url_end_idx:] 1555 else: 1556 # Anchor not allowed here. 1557 curr_pos = start_idx + 1 1558 continue 1559 1560 # Reference anchor or img? 1561 else: 1562 match = self._tail_of_reference_link_re.match(text, p) 1563 if match: 1564 # Handle a reference-style anchor or img. 1565 is_img = start_idx > 0 and text[start_idx - 1] == "!" 1566 if is_img: 1567 start_idx -= 1 1568 link_id = match.group("id").lower() 1569 if not link_id: 1570 link_id = link_text.lower() # for links like [this][] 1571 if link_id in self.urls: 1572 url = self.urls[link_id] 1573 # We've got to encode these to avoid conflicting 1574 # with italics/bold. 1575 url = url.replace('*', self._escape_table['*']) \ 1576 .replace('_', self._escape_table['_']) 1577 title = self.titles.get(link_id) 1578 if title: 1579 title = _xml_escape_attr(title) \ 1580 .replace('*', self._escape_table['*']) \ 1581 .replace('_', self._escape_table['_']) 1582 title_str = ' title="%s"' % title 1583 else: 1584 title_str = '' 1585 if is_img: 1586 img_class_str = self._html_class_str_from_tag("img") 1587 result = '<img src="%s" alt="%s"%s%s%s' \ 1588 % (_html_escape_url(url, safe_mode=self.safe_mode), 1589 _xml_escape_attr(link_text), 1590 title_str, 1591 img_class_str, 1592 self.empty_element_suffix) 1593 if "smarty-pants" in self.extras: 1594 result = result.replace('"', self._escape_table['"']) 1595 curr_pos = start_idx + len(result) 1596 text = text[:start_idx] + result + text[match.end():] 1597 elif start_idx >= anchor_allowed_pos: 1598 if self.safe_mode and not self._safe_protocols.match(url): 1599 result_head = '<a href="#"%s>' % (title_str) 1600 else: 1601 result_head = '<a href="%s"%s>' % ( 1602 _html_escape_url(url, safe_mode=self.safe_mode), title_str) 1603 result = '%s%s</a>' % (result_head, link_text) 1604 if "smarty-pants" in self.extras: 1605 result = result.replace('"', self._escape_table['"']) 1606 # <img> allowed from curr_pos on, <a> from 1607 # anchor_allowed_pos on. 1608 curr_pos = start_idx + len(result_head) 1609 anchor_allowed_pos = start_idx + len(result) 1610 text = text[:start_idx] + result + text[match.end():] 1611 else: 1612 # Anchor not allowed here. 1613 curr_pos = start_idx + 1 1614 else: 1615 # This id isn't defined, leave the markup alone. 1616 curr_pos = match.end() 1617 continue 1618 1619 # Otherwise, it isn't markup. 1620 curr_pos = start_idx + 1 1621 1622 return text 1623 1624 def header_id_from_text(self, text, prefix, n): 1625 """Generate a header id attribute value from the given header 1626 HTML content. 1627 1628 This is only called if the "header-ids" extra is enabled. 1629 Subclasses may override this for different header ids. 1630 1631 @param text {str} The text of the header tag 1632 @param prefix {str} The requested prefix for header ids. This is the 1633 value of the "header-ids" extra key, if any. Otherwise, None. 1634 @param n {int} The <hN> tag number, i.e. `1` for an <h1> tag. 1635 @returns {str} The value for the header tag's "id" attribute. Return 1636 None to not have an id attribute and to exclude this header from 1637 the TOC (if the "toc" extra is specified). 1638 """ 1639 header_id = _slugify(text) 1640 if prefix and isinstance(prefix, str): 1641 header_id = prefix + '-' + header_id 1642 1643 self._count_from_header_id[header_id] += 1 1644 if 0 == len(header_id) or self._count_from_header_id[header_id] > 1: 1645 header_id += '-%s' % self._count_from_header_id[header_id] 1646 1647 return header_id 1648 1649 def _toc_add_entry(self, level, id, name): 1650 if level > self._toc_depth: 1651 return 1652 if self._toc is None: 1653 self._toc = [] 1654 self._toc.append((level, id, self._unescape_special_chars(name))) 1655 1656 _h_re_base = r''' 1657 (^(.+)[ \t]{0,99}\n(=+|-+)[ \t]*\n+) 1658 | 1659 (^(\#{1,6}) # \1 = string of #'s 1660 [ \t]%s 1661 (.+?) # \2 = Header text 1662 [ \t]{0,99} 1663 (?<!\\) # ensure not an escaped trailing '#' 1664 \#* # optional closing #'s (not counted) 1665 \n+ 1666 ) 1667 ''' 1668 1669 _h_re = re.compile(_h_re_base % '*', re.X | re.M) 1670 _h_re_tag_friendly = re.compile(_h_re_base % '+', re.X | re.M) 1671 1672 def _h_sub(self, match): 1673 if match.group(1) is not None and match.group(3) == "-": 1674 return match.group(1) 1675 elif match.group(1) is not None: 1676 # Setext header 1677 n = {"=": 1, "-": 2}[match.group(3)[0]] 1678 header_group = match.group(2) 1679 else: 1680 # atx header 1681 n = len(match.group(5)) 1682 header_group = match.group(6) 1683 1684 demote_headers = self.extras.get("demote-headers") 1685 if demote_headers: 1686 n = min(n + demote_headers, 6) 1687 header_id_attr = "" 1688 if "header-ids" in self.extras: 1689 header_id = self.header_id_from_text(header_group, 1690 self.extras["header-ids"], n) 1691 if header_id: 1692 header_id_attr = ' id="%s"' % header_id 1693 html = self._run_span_gamut(header_group) 1694 if "toc" in self.extras and header_id: 1695 self._toc_add_entry(n, header_id, html) 1696 return "<h%d%s>%s</h%d>\n\n" % (n, header_id_attr, html, n) 1697 1698 def _do_headers(self, text): 1699 # Setext-style headers: 1700 # Header 1 1701 # ======== 1702 # 1703 # Header 2 1704 # -------- 1705 1706 # atx-style headers: 1707 # # Header 1 1708 # ## Header 2 1709 # ## Header 2 with closing hashes ## 1710 # ... 1711 # ###### Header 6 1712 1713 if 'tag-friendly' in self.extras: 1714 return self._h_re_tag_friendly.sub(self._h_sub, text) 1715 return self._h_re.sub(self._h_sub, text) 1716 1717 _marker_ul_chars = '*+-' 1718 _marker_any = r'(?:[%s]|\d+\.)' % _marker_ul_chars 1719 _marker_ul = '(?:[%s])' % _marker_ul_chars 1720 _marker_ol = r'(?:\d+\.)' 1721 1722 def _list_sub(self, match): 1723 lst = match.group(1) 1724 lst_type = match.group(3) in self._marker_ul_chars and "ul" or "ol" 1725 result = self._process_list_items(lst) 1726 if self.list_level: 1727 return "<%s>\n%s</%s>\n" % (lst_type, result, lst_type) 1728 else: 1729 return "<%s>\n%s</%s>\n\n" % (lst_type, result, lst_type) 1730 1731 def _do_lists(self, text): 1732 # Form HTML ordered (numbered) and unordered (bulleted) lists. 1733 1734 # Iterate over each *non-overlapping* list match. 1735 pos = 0 1736 while True: 1737 # Find the *first* hit for either list style (ul or ol). We 1738 # match ul and ol separately to avoid adjacent lists of different 1739 # types running into each other (see issue #16). 1740 hits = [] 1741 for marker_pat in (self._marker_ul, self._marker_ol): 1742 less_than_tab = self.tab_width - 1 1743 whole_list = r''' 1744 ( # \1 = whole list 1745 ( # \2 1746 [ ]{0,%d} 1747 (%s) # \3 = first list item marker 1748 [ \t]+ 1749 (?!\ *\3\ ) # '- - - ...' isn't a list. See 'not_quite_a_list' test case. 1750 ) 1751 (?:.+?) 1752 ( # \4 1753 \Z 1754 | 1755 \n{2,} 1756 (?=\S) 1757 (?! # Negative lookahead for another list item marker 1758 [ \t]* 1759 %s[ \t]+ 1760 ) 1761 ) 1762 ) 1763 ''' % (less_than_tab, marker_pat, marker_pat) 1764 if self.list_level: # sub-list 1765 list_re = re.compile("^" + whole_list, re.X | re.M | re.S) 1766 else: 1767 list_re = re.compile(r"(?:(?<=\n\n)|\A\n?)" + whole_list, 1768 re.X | re.M | re.S) 1769 match = list_re.search(text, pos) 1770 if match: 1771 hits.append((match.start(), match)) 1772 if not hits: 1773 break 1774 hits.sort() 1775 match = hits[0][1] 1776 start, end = match.span() 1777 middle = self._list_sub(match) 1778 text = text[:start] + middle + text[end:] 1779 pos = start + len(middle) # start pos for next attempted match 1780 1781 return text 1782 1783 _list_item_re = re.compile(r''' 1784 (\n)? # leading line = \1 1785 (^[ \t]*) # leading whitespace = \2 1786 (?P<marker>%s) [ \t]+ # list marker = \3 1787 ((?:.+?) # list item text = \4 1788 (\n{1,2})) # eols = \5 1789 (?= \n* (\Z | \2 (?P<next_marker>%s) [ \t]+)) 1790 ''' % (_marker_any, _marker_any), 1791 re.M | re.X | re.S) 1792 1793 _task_list_item_re = re.compile(r''' 1794 (\[[\ xX]\])[ \t]+ # tasklist marker = \1 1795 (.*) # list item text = \2 1796 ''', re.M | re.X | re.S) 1797 1798 _task_list_warpper_str = r'<input type="checkbox" class="task-list-item-checkbox" %sdisabled> %s' 1799 1800 def _task_list_item_sub(self, match): 1801 marker = match.group(1) 1802 item_text = match.group(2) 1803 if marker in ['[x]', '[X]']: 1804 return self._task_list_warpper_str % ('checked ', item_text) 1805 elif marker == '[ ]': 1806 return self._task_list_warpper_str % ('', item_text) 1807 1808 _last_li_endswith_two_eols = False 1809 1810 def _list_item_sub(self, match): 1811 item = match.group(4) 1812 leading_line = match.group(1) 1813 if leading_line or "\n\n" in item or self._last_li_endswith_two_eols: 1814 item = self._run_block_gamut(self._outdent(item)) 1815 else: 1816 # Recursion for sub-lists: 1817 item = self._do_lists(self._uniform_outdent(item, min_outdent=' ')[1]) 1818 if item.endswith('\n'): 1819 item = item[:-1] 1820 item = self._run_span_gamut(item) 1821 self._last_li_endswith_two_eols = (len(match.group(5)) == 2) 1822 1823 if "task_list" in self.extras: 1824 item = self._task_list_item_re.sub(self._task_list_item_sub, item) 1825 1826 return "<li>%s</li>\n" % item 1827 1828 def _process_list_items(self, list_str): 1829 # Process the contents of a single ordered or unordered list, 1830 # splitting it into individual list items. 1831 1832 # The $g_list_level global keeps track of when we're inside a list. 1833 # Each time we enter a list, we increment it; when we leave a list, 1834 # we decrement. If it's zero, we're not in a list anymore. 1835 # 1836 # We do this because when we're not inside a list, we want to treat 1837 # something like this: 1838 # 1839 # I recommend upgrading to version 1840 # 8. Oops, now this line is treated 1841 # as a sub-list. 1842 # 1843 # As a single paragraph, despite the fact that the second line starts 1844 # with a digit-period-space sequence. 1845 # 1846 # Whereas when we're inside a list (or sub-list), that line will be 1847 # treated as the start of a sub-list. What a kludge, huh? This is 1848 # an aspect of Markdown's syntax that's hard to parse perfectly 1849 # without resorting to mind-reading. Perhaps the solution is to 1850 # change the syntax rules such that sub-lists must start with a 1851 # starting cardinal number; e.g. "1." or "a.". 1852 self.list_level += 1 1853 self._last_li_endswith_two_eols = False 1854 list_str = list_str.rstrip('\n') + '\n' 1855 list_str = self._list_item_re.sub(self._list_item_sub, list_str) 1856 self.list_level -= 1 1857 return list_str 1858 1859 def _get_pygments_lexer(self, lexer_name): 1860 try: 1861 from pygments import lexers, util 1862 except ImportError: 1863 return None 1864 try: 1865 return lexers.get_lexer_by_name(lexer_name) 1866 except util.ClassNotFound: 1867 return None 1868 1869 def _color_with_pygments(self, codeblock, lexer, **formatter_opts): 1870 import pygments 1871 import pygments.formatters 1872 1873 class HtmlCodeFormatter(pygments.formatters.HtmlFormatter): 1874 def _wrap_code(self, inner): 1875 """A function for use in a Pygments Formatter which 1876 wraps in <code> tags. 1877 """ 1878 yield 0, "<code>" 1879 for tup in inner: 1880 yield tup 1881 yield 0, "</code>" 1882 1883 def _add_newline(self, inner): 1884 # Add newlines around the inner contents so that _strict_tag_block_re matches the outer div. 1885 yield 0, "\n" 1886 yield from inner 1887 yield 0, "\n" 1888 1889 def wrap(self, source, outfile=None): 1890 """Return the source with a code, pre, and div.""" 1891 if outfile is None: 1892 # pygments >= 2.12 1893 return self._add_newline(self._wrap_pre(self._wrap_code(source))) 1894 else: 1895 # pygments < 2.12 1896 return self._wrap_div(self._add_newline(self._wrap_pre(self._wrap_code(source)))) 1897 1898 formatter_opts.setdefault("cssclass", "codehilite") 1899 formatter = HtmlCodeFormatter(**formatter_opts) 1900 return pygments.highlight(codeblock, lexer, formatter) 1901 1902 def _code_block_sub(self, match, is_fenced_code_block=False): 1903 lexer_name = None 1904 if is_fenced_code_block: 1905 lexer_name = match.group(2) 1906 codeblock = match.group(3) 1907 codeblock = codeblock[:-1] # drop one trailing newline 1908 else: 1909 codeblock = match.group(1) 1910 codeblock = self._outdent(codeblock) 1911 codeblock = self._detab(codeblock) 1912 codeblock = codeblock.lstrip('\n') # trim leading newlines 1913 codeblock = codeblock.rstrip() # trim trailing whitespace 1914 1915 # Note: "code-color" extra is DEPRECATED. 1916 if "code-color" in self.extras and codeblock.startswith(":::"): 1917 lexer_name, rest = codeblock.split('\n', 1) 1918 lexer_name = lexer_name[3:].strip() 1919 codeblock = rest.lstrip("\n") # Remove lexer declaration line. 1920 1921 # Use pygments only if not using the highlightjs-lang extra 1922 if lexer_name and "highlightjs-lang" not in self.extras: 1923 lexer = self._get_pygments_lexer(lexer_name) 1924 if lexer: 1925 leading_indent = ' ' * (len(match.group(1)) - len(match.group(1).lstrip())) 1926 return self._code_block_with_lexer_sub(codeblock, leading_indent, lexer, is_fenced_code_block) 1927 1928 pre_class_str = self._html_class_str_from_tag("pre") 1929 1930 if "highlightjs-lang" in self.extras and lexer_name: 1931 code_class_str = ' class="%s language-%s"' % (lexer_name, lexer_name) 1932 else: 1933 code_class_str = self._html_class_str_from_tag("code") 1934 1935 if is_fenced_code_block: 1936 # Fenced code blocks need to be outdented before encoding, and then reapplied 1937 leading_indent = ' ' * (len(match.group(1)) - len(match.group(1).lstrip())) 1938 leading_indent, codeblock = self._uniform_outdent_limit(codeblock, leading_indent) 1939 1940 codeblock = self._encode_code(codeblock) 1941 1942 return "\n%s<pre%s><code%s>%s\n</code></pre>\n" % ( 1943 leading_indent, pre_class_str, code_class_str, codeblock) 1944 else: 1945 codeblock = self._encode_code(codeblock) 1946 1947 return "\n<pre%s><code%s>%s\n</code></pre>\n" % ( 1948 pre_class_str, code_class_str, codeblock) 1949 1950 def _code_block_with_lexer_sub(self, codeblock, leading_indent, lexer, is_fenced_code_block): 1951 if is_fenced_code_block: 1952 formatter_opts = self.extras['fenced-code-blocks'] or {} 1953 else: 1954 formatter_opts = self.extras['code-color'] or {} 1955 1956 def unhash_code(codeblock): 1957 for key, sanitized in list(self.html_spans.items()): 1958 codeblock = codeblock.replace(key, sanitized) 1959 replacements = [ 1960 ("&", "&"), 1961 ("<", "<"), 1962 (">", ">") 1963 ] 1964 for old, new in replacements: 1965 codeblock = codeblock.replace(old, new) 1966 return codeblock 1967 1968 # remove leading indent from code block 1969 leading_indent, codeblock = self._uniform_outdent(codeblock) 1970 1971 codeblock = unhash_code(codeblock) 1972 colored = self._color_with_pygments(codeblock, lexer, 1973 **formatter_opts) 1974 1975 # add back the indent to all lines 1976 return "\n%s\n" % self._uniform_indent(colored, leading_indent, True) 1977 1978 def _html_class_str_from_tag(self, tag): 1979 """Get the appropriate ' class="..."' string (note the leading 1980 space), if any, for the given tag. 1981 """ 1982 if "html-classes" not in self.extras: 1983 return "" 1984 try: 1985 html_classes_from_tag = self.extras["html-classes"] 1986 except TypeError: 1987 return "" 1988 else: 1989 if isinstance(html_classes_from_tag, dict): 1990 if tag in html_classes_from_tag: 1991 return ' class="%s"' % html_classes_from_tag[tag] 1992 return "" 1993 1994 def _do_code_blocks(self, text): 1995 """Process Markdown `<pre><code>` blocks.""" 1996 code_block_re = re.compile(r''' 1997 (?:\n\n|\A\n?) 1998 ( # $1 = the code block -- one or more lines, starting with a space/tab 1999 (?: 2000 (?:[ ]{%d} | \t) # Lines must start with a tab or a tab-width of spaces 2001 .*\n+ 2002 )+ 2003 ) 2004 ((?=^[ ]{0,%d}\S)|\Z) # Lookahead for non-space at line-start, or end of doc 2005 # Lookahead to make sure this block isn't already in a code block. 2006 # Needed when syntax highlighting is being used. 2007 (?!([^<]|<(/?)span)*\</code\>) 2008 ''' % (self.tab_width, self.tab_width), 2009 re.M | re.X) 2010 return code_block_re.sub(self._code_block_sub, text) 2011 2012 _fenced_code_block_re = re.compile(r''' 2013 (?:\n+|\A\n?|(?<=\n)) 2014 (^[ \t]*`{3,})\s{0,99}?([\w+-]+)?\s{0,99}?\n # $1 = opening fence (captured for back-referencing), $2 = optional lang 2015 (.*?) # $3 = code block content 2016 \1[ \t]*\n # closing fence 2017 ''', re.M | re.X | re.S) 2018 2019 def _fenced_code_block_sub(self, match): 2020 return self._code_block_sub(match, is_fenced_code_block=True) 2021 2022 def _do_fenced_code_blocks(self, text): 2023 """Process ```-fenced unindented code blocks ('fenced-code-blocks' extra).""" 2024 return self._fenced_code_block_re.sub(self._fenced_code_block_sub, text) 2025 2026 # Rules for a code span: 2027 # - backslash escapes are not interpreted in a code span 2028 # - to include one or or a run of more backticks the delimiters must 2029 # be a longer run of backticks 2030 # - cannot start or end a code span with a backtick; pad with a 2031 # space and that space will be removed in the emitted HTML 2032 # See `test/tm-cases/escapes.text` for a number of edge-case 2033 # examples. 2034 _code_span_re = re.compile(r''' 2035 (?<!\\) 2036 (`+) # \1 = Opening run of ` 2037 (?!`) # See Note A test/tm-cases/escapes.text 2038 (.+?) # \2 = The code block 2039 (?<!`) 2040 \1 # Matching closer 2041 (?!`) 2042 ''', re.X | re.S) 2043 2044 def _code_span_sub(self, match): 2045 c = match.group(2).strip(" \t") 2046 c = self._encode_code(c) 2047 return "<code%s>%s</code>" % (self._html_class_str_from_tag("code"), c) 2048 2049 def _do_code_spans(self, text): 2050 # * Backtick quotes are used for <code></code> spans. 2051 # 2052 # * You can use multiple backticks as the delimiters if you want to 2053 # include literal backticks in the code span. So, this input: 2054 # 2055 # Just type ``foo `bar` baz`` at the prompt. 2056 # 2057 # Will translate to: 2058 # 2059 # <p>Just type <code>foo `bar` baz</code> at the prompt.</p> 2060 # 2061 # There's no arbitrary limit to the number of backticks you 2062 # can use as delimters. If you need three consecutive backticks 2063 # in your code, use four for delimiters, etc. 2064 # 2065 # * You can use spaces to get literal backticks at the edges: 2066 # 2067 # ... type `` `bar` `` ... 2068 # 2069 # Turns to: 2070 # 2071 # ... type <code>`bar`</code> ... 2072 return self._code_span_re.sub(self._code_span_sub, text) 2073 2074 def _encode_code(self, text): 2075 """Encode/escape certain characters inside Markdown code runs. 2076 The point is that in code, these characters are literals, 2077 and lose their special Markdown meanings. 2078 """ 2079 replacements = [ 2080 # Encode all ampersands; HTML entities are not 2081 # entities within a Markdown code span. 2082 ('&', '&'), 2083 # Do the angle bracket song and dance: 2084 ('<', '<'), 2085 ('>', '>'), 2086 ] 2087 for before, after in replacements: 2088 text = text.replace(before, after) 2089 hashed = _hash_text(text) 2090 self._code_table[text] = hashed 2091 return hashed 2092 2093 _admonitions = r'admonition|attention|caution|danger|error|hint|important|note|tip|warning' 2094 _admonitions_re = re.compile(r''' 2095 ^(\ *)\.\.\ (%s)::\ * # $1 leading indent, $2 the admonition 2096 (.*)? # $3 admonition title 2097 ((?:\s*\n\1\ {3,}.*)+?) # $4 admonition body (required) 2098 (?=\s*(?:\Z|\n{4,}|\n\1?\ {0,2}\S)) # until EOF, 3 blank lines or something less indented 2099 ''' % _admonitions, 2100 re.IGNORECASE | re.MULTILINE | re.VERBOSE 2101 ) 2102 2103 def _do_admonitions_sub(self, match): 2104 lead_indent, admonition_name, title, body = match.groups() 2105 2106 admonition_type = '<strong>%s</strong>' % admonition_name 2107 2108 # figure out the class names to assign the block 2109 if admonition_name.lower() == 'admonition': 2110 admonition_class = 'admonition' 2111 else: 2112 admonition_class = 'admonition %s' % admonition_name.lower() 2113 2114 # titles are generally optional 2115 if title: 2116 title = '<em>%s</em>' % title 2117 2118 # process the admonition body like regular markdown 2119 body = self._run_block_gamut("\n%s\n" % self._uniform_outdent(body)[1]) 2120 2121 # indent the body before placing inside the aside block 2122 admonition = self._uniform_indent('%s\n%s\n\n%s\n' % (admonition_type, title, body), self.tab, False) 2123 # wrap it in an aside 2124 admonition = '<aside class="%s">\n%s</aside>' % (admonition_class, admonition) 2125 # now indent the whole admonition back to where it started 2126 return self._uniform_indent(admonition, lead_indent, False) 2127 2128 def _do_admonitions(self, text): 2129 return self._admonitions_re.sub(self._do_admonitions_sub, text) 2130 2131 _strike_re = re.compile(r"~~(?=\S)(.+?)(?<=\S)~~", re.S) 2132 2133 def _do_strike(self, text): 2134 text = self._strike_re.sub(r"<s>\1</s>", text) 2135 return text 2136 2137 _underline_re = re.compile(r"(?<!<!)--(?!>)(?=\S)(.+?)(?<=\S)(?<!<!)--(?!>)", re.S) 2138 2139 def _do_underline(self, text): 2140 text = self._underline_re.sub(r"<u>\1</u>", text) 2141 return text 2142 2143 _strong_re = re.compile(r"(\*\*|__)(?=\S)(.+?[*_]*)(?<=\S)\1", re.S) 2144 _em_re = re.compile(r"(\*|_)(?=\S)(.+?)(?<=\S)\1", re.S) 2145 _code_friendly_strong_re = re.compile(r"\*\*(?=\S)(.+?[*_]*)(?<=\S)\*\*", re.S) 2146 _code_friendly_em_re = re.compile(r"\*(?=\S)(.+?)(?<=\S)\*", re.S) 2147 2148 def _do_italics_and_bold(self, text): 2149 # <strong> must go first: 2150 if "code-friendly" in self.extras: 2151 text = self._code_friendly_strong_re.sub(r"<strong>\1</strong>", text) 2152 text = self._code_friendly_em_re.sub(r"<em>\1</em>", text) 2153 else: 2154 text = self._strong_re.sub(r"<strong>\2</strong>", text) 2155 text = self._em_re.sub(r"<em>\2</em>", text) 2156 return text 2157 2158 # "smarty-pants" extra: Very liberal in interpreting a single prime as an 2159 # apostrophe; e.g. ignores the fact that "round", "bout", "twer", and 2160 # "twixt" can be written without an initial apostrophe. This is fine because 2161 # using scare quotes (single quotation marks) is rare. 2162 _apostrophe_year_re = re.compile(r"'(\d\d)(?=(\s|,|;|\.|\?|!|$))") 2163 _contractions = ["tis", "twas", "twer", "neath", "o", "n", 2164 "round", "bout", "twixt", "nuff", "fraid", "sup"] 2165 2166 def _do_smart_contractions(self, text): 2167 text = self._apostrophe_year_re.sub(r"’\1", text) 2168 for c in self._contractions: 2169 text = text.replace("'%s" % c, "’%s" % c) 2170 text = text.replace("'%s" % c.capitalize(), 2171 "’%s" % c.capitalize()) 2172 return text 2173 2174 # Substitute double-quotes before single-quotes. 2175 _opening_single_quote_re = re.compile(r"(?<!\S)'(?=\S)") 2176 _opening_double_quote_re = re.compile(r'(?<!\S)"(?=\S)') 2177 _closing_single_quote_re = re.compile(r"(?<=\S)'") 2178 _closing_double_quote_re = re.compile(r'(?<=\S)"(?=(\s|,|;|\.|\?|!|$))') 2179 2180 def _do_smart_punctuation(self, text): 2181 """Fancifies 'single quotes', "double quotes", and apostrophes. 2182 Converts --, ---, and ... into en dashes, em dashes, and ellipses. 2183 2184 Inspiration is: <http://daringfireball.net/projects/smartypants/> 2185 See "test/tm-cases/smarty_pants.text" for a full discussion of the 2186 support here and 2187 <http://code.google.com/p/python-markdown2/issues/detail?id=42> for a 2188 discussion of some diversion from the original SmartyPants. 2189 """ 2190 if "'" in text: # guard for perf 2191 text = self._do_smart_contractions(text) 2192 text = self._opening_single_quote_re.sub("‘", text) 2193 text = self._closing_single_quote_re.sub("’", text) 2194 2195 if '"' in text: # guard for perf 2196 text = self._opening_double_quote_re.sub("“", text) 2197 text = self._closing_double_quote_re.sub("”", text) 2198 2199 text = text.replace("---", "—") 2200 text = text.replace("--", "–") 2201 text = text.replace("...", "…") 2202 text = text.replace(" . . . ", "…") 2203 text = text.replace(". . .", "…") 2204 2205 # TODO: Temporary hack to fix https://github.com/trentm/python-markdown2/issues/150 2206 if "footnotes" in self.extras and "footnote-ref" in text: 2207 # Quotes in the footnote back ref get converted to "smart" quotes 2208 # Change them back here to ensure they work. 2209 text = text.replace('class="footnote-ref”', 'class="footnote-ref"') 2210 2211 return text 2212 2213 _block_quote_base = r''' 2214 ( # Wrap whole match in \1 2215 ( 2216 ^[ \t]*>%s[ \t]? # '>' at the start of a line 2217 .+\n # rest of the first line 2218 (.+\n)* # subsequent consecutive lines 2219 )+ 2220 ) 2221 ''' 2222 _block_quote_re = re.compile(_block_quote_base % '', re.M | re.X) 2223 _block_quote_re_spoiler = re.compile(_block_quote_base % '[ \t]*?!?', re.M | re.X) 2224 _bq_one_level_re = re.compile('^[ \t]*>[ \t]?', re.M) 2225 _bq_one_level_re_spoiler = re.compile('^[ \t]*>[ \t]*?![ \t]?', re.M) 2226 _bq_all_lines_spoilers = re.compile(r'\A(?:^[ \t]*>[ \t]*?!.*[\n\r]*)+\Z', re.M) 2227 _html_pre_block_re = re.compile(r'(\s*<pre>.+?</pre>)', re.S) 2228 2229 def _dedent_two_spaces_sub(self, match): 2230 return re.sub(r'(?m)^ ', '', match.group(1)) 2231 2232 def _block_quote_sub(self, match): 2233 bq = match.group(1) 2234 is_spoiler = 'spoiler' in self.extras and self._bq_all_lines_spoilers.match(bq) 2235 # trim one level of quoting 2236 if is_spoiler: 2237 bq = self._bq_one_level_re_spoiler.sub('', bq) 2238 else: 2239 bq = self._bq_one_level_re.sub('', bq) 2240 # trim whitespace-only lines 2241 bq = self._ws_only_line_re.sub('', bq) 2242 bq = self._run_block_gamut(bq) # recurse 2243 2244 bq = re.sub('(?m)^', ' ', bq) 2245 # These leading spaces screw with <pre> content, so we need to fix that: 2246 bq = self._html_pre_block_re.sub(self._dedent_two_spaces_sub, bq) 2247 2248 if is_spoiler: 2249 return '<blockquote class="spoiler">\n%s\n</blockquote>\n\n' % bq 2250 else: 2251 return '<blockquote>\n%s\n</blockquote>\n\n' % bq 2252 2253 def _do_block_quotes(self, text): 2254 if '>' not in text: 2255 return text 2256 if 'spoiler' in self.extras: 2257 return self._block_quote_re_spoiler.sub(self._block_quote_sub, text) 2258 else: 2259 return self._block_quote_re.sub(self._block_quote_sub, text) 2260 2261 def _form_paragraphs(self, text): 2262 # Strip leading and trailing lines: 2263 text = text.strip('\n') 2264 2265 # Wrap <p> tags. 2266 grafs = [] 2267 for i, graf in enumerate(re.split(r"\n{2,}", text)): 2268 if graf in self.html_blocks: 2269 # Unhashify HTML blocks 2270 grafs.append(self.html_blocks[graf]) 2271 else: 2272 cuddled_list = None 2273 if "cuddled-lists" in self.extras: 2274 # Need to put back trailing '\n' for `_list_item_re` 2275 # match at the end of the paragraph. 2276 li = self._list_item_re.search(graf + '\n') 2277 # Two of the same list marker in this paragraph: a likely 2278 # candidate for a list cuddled to preceding paragraph 2279 # text (issue 33). Note the `[-1]` is a quick way to 2280 # consider numeric bullets (e.g. "1." and "2.") to be 2281 # equal. 2282 if (li and len(li.group(2)) <= 3 2283 and ( 2284 (li.group("next_marker") and li.group("marker")[-1] == li.group("next_marker")[-1]) 2285 or 2286 li.group("next_marker") is None 2287 ) 2288 ): 2289 start = li.start() 2290 cuddled_list = self._do_lists(graf[start:]).rstrip("\n") 2291 assert cuddled_list.startswith("<ul>") or cuddled_list.startswith("<ol>") 2292 graf = graf[:start] 2293 2294 # Wrap <p> tags. 2295 graf = self._run_span_gamut(graf) 2296 grafs.append("<p%s>" % self._html_class_str_from_tag('p') + graf.lstrip(" \t") + "</p>") 2297 2298 if cuddled_list: 2299 grafs.append(cuddled_list) 2300 2301 return "\n\n".join(grafs) 2302 2303 def _add_footnotes(self, text): 2304 if self.footnotes: 2305 footer = [ 2306 '<div class="footnotes">', 2307 '<hr' + self.empty_element_suffix, 2308 '<ol>', 2309 ] 2310 2311 if not self.footnote_title: 2312 self.footnote_title = "Jump back to footnote %d in the text." 2313 if not self.footnote_return_symbol: 2314 self.footnote_return_symbol = "↩" 2315 2316 for i, id in enumerate(self.footnote_ids): 2317 if i != 0: 2318 footer.append('') 2319 footer.append('<li id="fn-%s">' % id) 2320 footer.append(self._run_block_gamut(self.footnotes[id])) 2321 try: 2322 backlink = ('<a href="#fnref-%s" ' + 2323 'class="footnoteBackLink" ' + 2324 'title="' + self.footnote_title + '">' + 2325 self.footnote_return_symbol + 2326 '</a>') % (id, i + 1) 2327 except TypeError: 2328 log.debug("Footnote error. `footnote_title` " 2329 "must include parameter. Using defaults.") 2330 backlink = ('<a href="#fnref-%s" ' 2331 'class="footnoteBackLink" ' 2332 'title="Jump back to footnote %d in the text.">' 2333 '↩</a>' % (id, i + 1)) 2334 2335 if footer[-1].endswith("</p>"): 2336 footer[-1] = footer[-1][:-len("</p>")] \ 2337 + ' ' + backlink + "</p>" 2338 else: 2339 footer.append("\n<p>%s</p>" % backlink) 2340 footer.append('</li>') 2341 footer.append('</ol>') 2342 footer.append('</div>') 2343 return text + '\n\n' + '\n'.join(footer) 2344 else: 2345 return text 2346 2347 _naked_lt_re = re.compile(r'<(?![a-z/?\$!])', re.I) 2348 _naked_gt_re = re.compile(r'''(?<![a-z0-9?!/'"-])>''', re.I) 2349 2350 def _encode_amps_and_angles(self, text): 2351 # Smart processing for ampersands and angle brackets that need 2352 # to be encoded. 2353 text = _AMPERSAND_RE.sub('&', text) 2354 2355 # Encode naked <'s 2356 text = self._naked_lt_re.sub('<', text) 2357 2358 # Encode naked >'s 2359 # Note: Other markdown implementations (e.g. Markdown.pl, PHP 2360 # Markdown) don't do this. 2361 text = self._naked_gt_re.sub('>', text) 2362 return text 2363 2364 _incomplete_tags_re = re.compile(r"<(/?\w+?(?!\w)\s*?.+?[\s/]+?)") 2365 2366 def _encode_incomplete_tags(self, text): 2367 if self.safe_mode not in ("replace", "escape"): 2368 return text 2369 2370 if text.endswith(">"): 2371 return text # this is not an incomplete tag, this is a link in the form <http://x.y.z> 2372 2373 return self._incomplete_tags_re.sub("<\\1", text) 2374 2375 def _encode_backslash_escapes(self, text): 2376 for ch, escape in list(self._escape_table.items()): 2377 text = text.replace("\\" + ch, escape) 2378 return text 2379 2380 _auto_link_re = re.compile(r'<((https?|ftp):[^\'">\s]+)>', re.I) 2381 2382 def _auto_link_sub(self, match): 2383 g1 = match.group(1) 2384 return '<a href="%s">%s</a>' % (g1, g1) 2385 2386 _auto_email_link_re = re.compile(r""" 2387 < 2388 (?:mailto:)? 2389 ( 2390 [-.\w]+ 2391 \@ 2392 [-\w]+(\.[-\w]+)*\.[a-z]+ 2393 ) 2394 > 2395 """, re.I | re.X | re.U) 2396 2397 def _auto_email_link_sub(self, match): 2398 return self._encode_email_address( 2399 self._unescape_special_chars(match.group(1))) 2400 2401 def _do_auto_links(self, text): 2402 text = self._auto_link_re.sub(self._auto_link_sub, text) 2403 text = self._auto_email_link_re.sub(self._auto_email_link_sub, text) 2404 return text 2405 2406 def _encode_email_address(self, addr): 2407 # Input: an email address, e.g. "foo@example.com" 2408 # 2409 # Output: the email address as a mailto link, with each character 2410 # of the address encoded as either a decimal or hex entity, in 2411 # the hopes of foiling most address harvesting spam bots. E.g.: 2412 # 2413 # <a href="mailto:foo@e 2414 # xample.com">foo 2415 # @example.com</a> 2416 # 2417 # Based on a filter by Matthew Wickline, posted to the BBEdit-Talk 2418 # mailing list: <http://tinyurl.com/yu7ue> 2419 chars = [_xml_encode_email_char_at_random(ch) 2420 for ch in "mailto:" + addr] 2421 # Strip the mailto: from the visible part. 2422 addr = '<a href="%s">%s</a>' \ 2423 % (''.join(chars), ''.join(chars[7:])) 2424 return addr 2425 2426 _basic_link_re = re.compile(r'!?\[.*?\]\(.*?\)') 2427 2428 def _do_link_patterns(self, text): 2429 link_from_hash = {} 2430 for regex, repl in self.link_patterns: 2431 replacements = [] 2432 for match in regex.finditer(text): 2433 if hasattr(repl, "__call__"): 2434 href = repl(match) 2435 else: 2436 href = match.expand(repl) 2437 replacements.append((match.span(), href)) 2438 for (start, end), href in reversed(replacements): 2439 2440 # Do not match against links inside brackets. 2441 if text[start - 1:start] == '[' and text[end:end + 1] == ']': 2442 continue 2443 2444 # Do not match against links in the standard markdown syntax. 2445 if text[start - 2:start] == '](' or text[end:end + 2] == '")': 2446 continue 2447 2448 # Do not match against links which are escaped. 2449 if text[start - 3:start] == '"""' and text[end:end + 3] == '"""': 2450 text = text[:start - 3] + text[start:end] + text[end + 3:] 2451 continue 2452 2453 # search the text for anything that looks like a link 2454 is_inside_link = False 2455 for link_re in (self._auto_link_re, self._basic_link_re): 2456 for match in link_re.finditer(text): 2457 if any((r[0] <= start and end <= r[1]) for r in match.regs): 2458 # if the link pattern start and end pos is within the bounds of 2459 # something that looks like a link, then don't process it 2460 is_inside_link = True 2461 break 2462 else: 2463 continue 2464 break 2465 2466 if is_inside_link: 2467 continue 2468 2469 escaped_href = ( 2470 href.replace('"', '"') # b/c of attr quote 2471 # To avoid markdown <em> and <strong>: 2472 .replace('*', self._escape_table['*']) 2473 .replace('_', self._escape_table['_'])) 2474 link = '<a href="%s">%s</a>' % (escaped_href, text[start:end]) 2475 hash = _hash_text(link) 2476 link_from_hash[hash] = link 2477 text = text[:start] + hash + text[end:] 2478 for hash, link in list(link_from_hash.items()): 2479 text = text.replace(hash, link) 2480 return text 2481 2482 def _unescape_special_chars(self, text): 2483 # Swap back in all the special characters we've hidden. 2484 for ch, hash in list(self._escape_table.items()) + list(self._code_table.items()): 2485 text = text.replace(hash, ch) 2486 return text 2487 2488 def _outdent(self, text): 2489 # Remove one level of line-leading tabs or spaces 2490 return self._outdent_re.sub('', text) 2491 2492 def _uniform_outdent(self, text, min_outdent=None): 2493 # Removes the smallest common leading indentation from each line 2494 # of `text` and returns said indent along with the outdented text. 2495 # The `min_outdent` kwarg only outdents lines that start with at 2496 # least this level of indentation or more. 2497 2498 # Find leading indentation of each line 2499 ws = re.findall(r'(^[ \t]*)(?:[^ \t\n])', text, re.MULTILINE) 2500 # Sort the indents within bounds 2501 if min_outdent: 2502 # dont use "is not None" here so we avoid iterating over ws 2503 # if min_outdent == '', which would do nothing 2504 ws = [i for i in ws if len(min_outdent) <= len(i)] 2505 if not ws: 2506 return '', text 2507 # Get smallest common leading indent 2508 ws = sorted(ws)[0] 2509 # Dedent every line by smallest common indent 2510 return ws, ''.join( 2511 (line.replace(ws, '', 1) if line.startswith(ws) else line) 2512 for line in text.splitlines(True) 2513 ) 2514 2515 def _uniform_outdent_limit(self, text, outdent): 2516 # Outdents up to `outdent`. Similar to `_uniform_outdent`, but 2517 # will leave some indentation on the line with the smallest common 2518 # leading indentation depending on the amount specified. 2519 # If the smallest leading indentation is less than `outdent`, it will 2520 # perform identical to `_uniform_outdent` 2521 2522 # Find leading indentation of each line 2523 ws = re.findall(r'(^[ \t]*)(?:[^ \t\n])', text, re.MULTILINE) 2524 if not ws: 2525 return outdent, text 2526 # Get smallest common leading indent 2527 ws = sorted(ws)[0] 2528 if len(outdent) > len(ws): 2529 outdent = ws 2530 return outdent, ''.join( 2531 (line.replace(outdent, '', 1) if line.startswith(outdent) else line) 2532 for line in text.splitlines(True) 2533 ) 2534 2535 def _uniform_indent(self, text, indent, include_empty_lines=False): 2536 return ''.join( 2537 (indent + line if line.strip() or include_empty_lines else '') 2538 for line in text.splitlines(True) 2539 ) 2540 2541 2542class MarkdownWithExtras(Markdown): 2543 """A markdowner class that enables most extras: 2544 2545 - footnotes 2546 - code-color (only has effect if 'pygments' Python module on path) 2547 2548 These are not included: 2549 - pyshell (specific to Python-related documenting) 2550 - code-friendly (because it *disables* part of the syntax) 2551 - link-patterns (because you need to specify some actual 2552 link-patterns anyway) 2553 """ 2554 extras = ["footnotes", "code-color"] 2555 2556 2557# ---- internal support functions 2558 2559 2560def calculate_toc_html(toc): 2561 """Return the HTML for the current TOC. 2562 2563 This expects the `_toc` attribute to have been set on this instance. 2564 """ 2565 if toc is None: 2566 return None 2567 2568 def indent(): 2569 return ' ' * (len(h_stack) - 1) 2570 2571 lines = [] 2572 h_stack = [0] # stack of header-level numbers 2573 for level, id, name in toc: 2574 if level > h_stack[-1]: 2575 lines.append("%s<ul>" % indent()) 2576 h_stack.append(level) 2577 elif level == h_stack[-1]: 2578 lines[-1] += "</li>" 2579 else: 2580 while level < h_stack[-1]: 2581 h_stack.pop() 2582 if not lines[-1].endswith("</li>"): 2583 lines[-1] += "</li>" 2584 lines.append("%s</ul></li>" % indent()) 2585 lines.append('%s<li><a href="#%s">%s</a>' % ( 2586 indent(), id, name)) 2587 while len(h_stack) > 1: 2588 h_stack.pop() 2589 if not lines[-1].endswith("</li>"): 2590 lines[-1] += "</li>" 2591 lines.append("%s</ul>" % indent()) 2592 return '\n'.join(lines) + '\n' 2593 2594 2595class UnicodeWithAttrs(str): 2596 """A subclass of unicode used for the return value of conversion to 2597 possibly attach some attributes. E.g. the "toc_html" attribute when 2598 the "toc" extra is used. 2599 """ 2600 metadata = None 2601 toc_html = None 2602 2603 2604## {{{ http://code.activestate.com/recipes/577257/ (r1) 2605_slugify_strip_re = re.compile(r'[^\w\s-]') 2606_slugify_hyphenate_re = re.compile(r'[-\s]+') 2607 2608 2609def _slugify(value): 2610 """ 2611 Normalizes string, converts to lowercase, removes non-alpha characters, 2612 and converts spaces to hyphens. 2613 2614 From Django's "django/template/defaultfilters.py". 2615 """ 2616 import unicodedata 2617 value = unicodedata.normalize('NFKD', value).encode('ascii', 'ignore').decode() 2618 value = _slugify_strip_re.sub('', value).strip().lower() 2619 return _slugify_hyphenate_re.sub('-', value) 2620 2621 2622## end of http://code.activestate.com/recipes/577257/ }}} 2623 2624 2625# From http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/52549 2626def _curry(*args, **kwargs): 2627 function, args = args[0], args[1:] 2628 2629 def result(*rest, **kwrest): 2630 combined = kwargs.copy() 2631 combined.update(kwrest) 2632 return function(*args + rest, **combined) 2633 2634 return result 2635 2636 2637# Recipe: regex_from_encoded_pattern (1.0) 2638def _regex_from_encoded_pattern(s): 2639 """'foo' -> re.compile(re.escape('foo')) 2640 '/foo/' -> re.compile('foo') 2641 '/foo/i' -> re.compile('foo', re.I) 2642 """ 2643 if s.startswith('/') and s.rfind('/') != 0: 2644 # Parse it: /PATTERN/FLAGS 2645 idx = s.rfind('/') 2646 _, flags_str = s[1:idx], s[idx + 1:] 2647 flag_from_char = { 2648 "i": re.IGNORECASE, 2649 "l": re.LOCALE, 2650 "s": re.DOTALL, 2651 "m": re.MULTILINE, 2652 "u": re.UNICODE, 2653 } 2654 flags = 0 2655 for char in flags_str: 2656 try: 2657 flags |= flag_from_char[char] 2658 except KeyError: 2659 raise ValueError("unsupported regex flag: '%s' in '%s' " 2660 "(must be one of '%s')" 2661 % (char, s, ''.join(list(flag_from_char.keys())))) 2662 return re.compile(s[1:idx], flags) 2663 else: # not an encoded regex 2664 return re.compile(re.escape(s)) 2665 2666 2667# Recipe: dedent (0.1.2) 2668def _dedentlines(lines, tabsize=8, skip_first_line=False): 2669 """_dedentlines(lines, tabsize=8, skip_first_line=False) -> dedented lines 2670 2671 "lines" is a list of lines to dedent. 2672 "tabsize" is the tab width to use for indent width calculations. 2673 "skip_first_line" is a boolean indicating if the first line should 2674 be skipped for calculating the indent width and for dedenting. 2675 This is sometimes useful for docstrings and similar. 2676 2677 Same as dedent() except operates on a sequence of lines. Note: the 2678 lines list is modified **in-place**. 2679 """ 2680 DEBUG = False 2681 if DEBUG: 2682 print("dedent: dedent(..., tabsize=%d, skip_first_line=%r)" \ 2683 % (tabsize, skip_first_line)) 2684 margin = None 2685 for i, line in enumerate(lines): 2686 if i == 0 and skip_first_line: continue 2687 indent = 0 2688 for ch in line: 2689 if ch == ' ': 2690 indent += 1 2691 elif ch == '\t': 2692 indent += tabsize - (indent % tabsize) 2693 elif ch in '\r\n': 2694 continue # skip all-whitespace lines 2695 else: 2696 break 2697 else: 2698 continue # skip all-whitespace lines 2699 if DEBUG: print("dedent: indent=%d: %r" % (indent, line)) 2700 if margin is None: 2701 margin = indent 2702 else: 2703 margin = min(margin, indent) 2704 if DEBUG: print("dedent: margin=%r" % margin) 2705 2706 if margin is not None and margin > 0: 2707 for i, line in enumerate(lines): 2708 if i == 0 and skip_first_line: continue 2709 removed = 0 2710 for j, ch in enumerate(line): 2711 if ch == ' ': 2712 removed += 1 2713 elif ch == '\t': 2714 removed += tabsize - (removed % tabsize) 2715 elif ch in '\r\n': 2716 if DEBUG: print("dedent: %r: EOL -> strip up to EOL" % line) 2717 lines[i] = lines[i][j:] 2718 break 2719 else: 2720 raise ValueError("unexpected non-whitespace char %r in " 2721 "line %r while removing %d-space margin" 2722 % (ch, line, margin)) 2723 if DEBUG: 2724 print("dedent: %r: %r -> removed %d/%d" \ 2725 % (line, ch, removed, margin)) 2726 if removed == margin: 2727 lines[i] = lines[i][j + 1:] 2728 break 2729 elif removed > margin: 2730 lines[i] = ' ' * (removed - margin) + lines[i][j + 1:] 2731 break 2732 else: 2733 if removed: 2734 lines[i] = lines[i][removed:] 2735 return lines 2736 2737 2738def _dedent(text, tabsize=8, skip_first_line=False): 2739 """_dedent(text, tabsize=8, skip_first_line=False) -> dedented text 2740 2741 "text" is the text to dedent. 2742 "tabsize" is the tab width to use for indent width calculations. 2743 "skip_first_line" is a boolean indicating if the first line should 2744 be skipped for calculating the indent width and for dedenting. 2745 This is sometimes useful for docstrings and similar. 2746 2747 textwrap.dedent(s), but don't expand tabs to spaces 2748 """ 2749 lines = text.splitlines(1) 2750 _dedentlines(lines, tabsize=tabsize, skip_first_line=skip_first_line) 2751 return ''.join(lines) 2752 2753 2754class _memoized(object): 2755 """Decorator that caches a function's return value each time it is called. 2756 If called later with the same arguments, the cached value is returned, and 2757 not re-evaluated. 2758 2759 http://wiki.python.org/moin/PythonDecoratorLibrary 2760 """ 2761 2762 def __init__(self, func): 2763 self.func = func 2764 self.cache = {} 2765 2766 def __call__(self, *args): 2767 try: 2768 return self.cache[args] 2769 except KeyError: 2770 self.cache[args] = value = self.func(*args) 2771 return value 2772 except TypeError: 2773 # uncachable -- for instance, passing a list as an argument. 2774 # Better to not cache than to blow up entirely. 2775 return self.func(*args) 2776 2777 def __repr__(self): 2778 """Return the function's docstring.""" 2779 return self.func.__doc__ 2780 2781 2782def _xml_oneliner_re_from_tab_width(tab_width): 2783 """Standalone XML processing instruction regex.""" 2784 return re.compile(r""" 2785 (?: 2786 (?<=\n\n) # Starting after a blank line 2787 | # or 2788 \A\n? # the beginning of the doc 2789 ) 2790 ( # save in $1 2791 [ ]{0,%d} 2792 (?: 2793 <\?\w+\b\s+.*?\?> # XML processing instruction 2794 | 2795 <\w+:\w+\b\s+.*?/> # namespaced single tag 2796 ) 2797 [ \t]* 2798 (?=\n{2,}|\Z) # followed by a blank line or end of document 2799 ) 2800 """ % (tab_width - 1), re.X) 2801 2802 2803_xml_oneliner_re_from_tab_width = _memoized(_xml_oneliner_re_from_tab_width) 2804 2805 2806def _hr_tag_re_from_tab_width(tab_width): 2807 return re.compile(r""" 2808 (?: 2809 (?<=\n\n) # Starting after a blank line 2810 | # or 2811 \A\n? # the beginning of the doc 2812 ) 2813 ( # save in \1 2814 [ ]{0,%d} 2815 <(hr) # start tag = \2 2816 \b # word break 2817 ([^<>])*? # 2818 /?> # the matching end tag 2819 [ \t]* 2820 (?=\n{2,}|\Z) # followed by a blank line or end of document 2821 ) 2822 """ % (tab_width - 1), re.X) 2823 2824 2825_hr_tag_re_from_tab_width = _memoized(_hr_tag_re_from_tab_width) 2826 2827 2828def _xml_escape_attr(attr, skip_single_quote=True): 2829 """Escape the given string for use in an HTML/XML tag attribute. 2830 2831 By default this doesn't bother with escaping `'` to `'`, presuming that 2832 the tag attribute is surrounded by double quotes. 2833 """ 2834 escaped = _AMPERSAND_RE.sub('&', attr) 2835 2836 escaped = (attr 2837 .replace('"', '"') 2838 .replace('<', '<') 2839 .replace('>', '>')) 2840 if not skip_single_quote: 2841 escaped = escaped.replace("'", "'") 2842 return escaped 2843 2844 2845def _xml_encode_email_char_at_random(ch): 2846 r = random() 2847 # Roughly 10% raw, 45% hex, 45% dec. 2848 # '@' *must* be encoded. I [John Gruber] insist. 2849 # Issue 26: '_' must be encoded. 2850 if r > 0.9 and ch not in "@_": 2851 return ch 2852 elif r < 0.45: 2853 # The [1:] is to drop leading '0': 0x63 -> x63 2854 return '&#%s;' % hex(ord(ch))[1:] 2855 else: 2856 return '&#%s;' % ord(ch) 2857 2858 2859def _html_escape_url(attr, safe_mode=False): 2860 """Replace special characters that are potentially malicious in url string.""" 2861 escaped = (attr 2862 .replace('"', '"') 2863 .replace('<', '<') 2864 .replace('>', '>')) 2865 if safe_mode: 2866 escaped = escaped.replace('+', ' ') 2867 escaped = escaped.replace("'", "'") 2868 return escaped 2869 2870 2871# ---- mainline 2872 2873class _NoReflowFormatter(optparse.IndentedHelpFormatter): 2874 """An optparse formatter that does NOT reflow the description.""" 2875 2876 def format_description(self, description): 2877 return description or "" 2878 2879 2880def _test(): 2881 import doctest 2882 doctest.testmod() 2883 2884 2885def main(argv=None): 2886 if argv is None: 2887 argv = sys.argv 2888 if not logging.root.handlers: 2889 logging.basicConfig() 2890 2891 usage = "usage: %prog [PATHS...]" 2892 version = "%prog " + __version__ 2893 parser = optparse.OptionParser(prog="markdown2", usage=usage, 2894 version=version, description=cmdln_desc, 2895 formatter=_NoReflowFormatter()) 2896 parser.add_option("-v", "--verbose", dest="log_level", 2897 action="store_const", const=logging.DEBUG, 2898 help="more verbose output") 2899 parser.add_option("--encoding", 2900 help="specify encoding of text content") 2901 parser.add_option("--html4tags", action="store_true", default=False, 2902 help="use HTML 4 style for empty element tags") 2903 parser.add_option("-s", "--safe", metavar="MODE", dest="safe_mode", 2904 help="sanitize literal HTML: 'escape' escapes " 2905 "HTML meta chars, 'replace' replaces with an " 2906 "[HTML_REMOVED] note") 2907 parser.add_option("-x", "--extras", action="append", 2908 help="Turn on specific extra features (not part of " 2909 "the core Markdown spec). See above.") 2910 parser.add_option("--use-file-vars", 2911 help="Look for and use Emacs-style 'markdown-extras' " 2912 "file var to turn on extras. See " 2913 "<https://github.com/trentm/python-markdown2/wiki/Extras>") 2914 parser.add_option("--link-patterns-file", 2915 help="path to a link pattern file") 2916 parser.add_option("--self-test", action="store_true", 2917 help="run internal self-tests (some doctests)") 2918 parser.add_option("--compare", action="store_true", 2919 help="run against Markdown.pl as well (for testing)") 2920 parser.set_defaults(log_level=logging.INFO, compare=False, 2921 encoding="utf-8", safe_mode=None, use_file_vars=False) 2922 opts, paths = parser.parse_args() 2923 log.setLevel(opts.log_level) 2924 2925 if opts.self_test: 2926 return _test() 2927 2928 if opts.extras: 2929 extras = {} 2930 for s in opts.extras: 2931 splitter = re.compile("[,;: ]+") 2932 for e in splitter.split(s): 2933 if '=' in e: 2934 ename, earg = e.split('=', 1) 2935 try: 2936 earg = int(earg) 2937 except ValueError: 2938 pass 2939 else: 2940 ename, earg = e, None 2941 extras[ename] = earg 2942 else: 2943 extras = None 2944 2945 if opts.link_patterns_file: 2946 link_patterns = [] 2947 f = open(opts.link_patterns_file) 2948 try: 2949 for i, line in enumerate(f.readlines()): 2950 if not line.strip(): continue 2951 if line.lstrip().startswith("#"): continue 2952 try: 2953 pat, href = line.rstrip().rsplit(None, 1) 2954 except ValueError: 2955 raise MarkdownError("%s:%d: invalid link pattern line: %r" 2956 % (opts.link_patterns_file, i + 1, line)) 2957 link_patterns.append( 2958 (_regex_from_encoded_pattern(pat), href)) 2959 finally: 2960 f.close() 2961 else: 2962 link_patterns = None 2963 2964 from os.path import join, dirname, abspath, exists 2965 markdown_pl = join(dirname(dirname(abspath(__file__))), "test", 2966 "Markdown.pl") 2967 if not paths: 2968 paths = ['-'] 2969 for path in paths: 2970 if path == '-': 2971 text = sys.stdin.read() 2972 else: 2973 fp = codecs.open(path, 'r', opts.encoding) 2974 text = fp.read() 2975 fp.close() 2976 if opts.compare: 2977 from subprocess import Popen, PIPE 2978 print("==== Markdown.pl ====") 2979 p = Popen('perl %s' % markdown_pl, shell=True, stdin=PIPE, stdout=PIPE, close_fds=True) 2980 p.stdin.write(text.encode('utf-8')) 2981 p.stdin.close() 2982 perl_html = p.stdout.read().decode('utf-8') 2983 sys.stdout.write(perl_html) 2984 print("==== markdown2.py ====") 2985 html = markdown(text, 2986 html4tags=opts.html4tags, 2987 safe_mode=opts.safe_mode, 2988 extras=extras, link_patterns=link_patterns, 2989 use_file_vars=opts.use_file_vars, 2990 cli=True) 2991 sys.stdout.write(html) 2992 if extras and "toc" in extras: 2993 log.debug("toc_html: " + 2994 str(html.toc_html.encode(sys.stdout.encoding or "utf-8", 'xmlcharrefreplace'))) 2995 if opts.compare: 2996 test_dir = join(dirname(dirname(abspath(__file__))), "test") 2997 if exists(join(test_dir, "test_markdown2.py")): 2998 sys.path.insert(0, test_dir) 2999 from test_markdown2 import norm_html_from_html 3000 norm_html = norm_html_from_html(html) 3001 norm_perl_html = norm_html_from_html(perl_html) 3002 else: 3003 norm_html = html 3004 norm_perl_html = perl_html 3005 print("==== match? %r ====" % (norm_perl_html == norm_html)) 3006 3007 3008if __name__ == "__main__": 3009 sys.exit(main(sys.argv)) class MarkdownError(builtins.Exception): Common base class for all non-exit exceptions.
Inherited Members
- builtins.Exception
- Exception
- builtins.BaseException
- with_traceback
- add_note
def markdown_path( path, encoding='utf-8', html4tags=False, tab_width=4, safe_mode=None, extras=None, link_patterns=None, footnote_title=None, footnote_return_symbol=None, use_file_vars=False):152def markdown_path(path, encoding="utf-8", 153 html4tags=False, tab_width=DEFAULT_TAB_WIDTH, 154 safe_mode=None, extras=None, link_patterns=None, 155 footnote_title=None, footnote_return_symbol=None, 156 use_file_vars=False): 157 fp = codecs.open(path, 'r', encoding) 158 text = fp.read() 159 fp.close() 160 return Markdown(html4tags=html4tags, tab_width=tab_width, 161 safe_mode=safe_mode, extras=extras, 162 link_patterns=link_patterns, 163 footnote_title=footnote_title, 164 footnote_return_symbol=footnote_return_symbol, 165 use_file_vars=use_file_vars).convert(text) def markdown( text, html4tags=False, tab_width=4, safe_mode=None, extras=None, link_patterns=None, footnote_title=None, footnote_return_symbol=None, use_file_vars=False, cli=False):168def markdown(text, html4tags=False, tab_width=DEFAULT_TAB_WIDTH, 169 safe_mode=None, extras=None, link_patterns=None, 170 footnote_title=None, footnote_return_symbol=None, 171 use_file_vars=False, cli=False): 172 return Markdown(html4tags=html4tags, tab_width=tab_width, 173 safe_mode=safe_mode, extras=extras, 174 link_patterns=link_patterns, 175 footnote_title=footnote_title, 176 footnote_return_symbol=footnote_return_symbol, 177 use_file_vars=use_file_vars, cli=cli).convert(text) class Markdown:180class Markdown(object): 181 # The dict of "extras" to enable in processing -- a mapping of 182 # extra name to argument for the extra. Most extras do not have an 183 # argument, in which case the value is None. 184 # 185 # This can be set via (a) subclassing and (b) the constructor 186 # "extras" argument. 187 extras = None 188 189 urls = None 190 titles = None 191 html_blocks = None 192 html_spans = None 193 html_removed_text = "{(#HTML#)}" # placeholder removed text that does not trigger bold 194 html_removed_text_compat = "[HTML_REMOVED]" # for compat with markdown.py 195 196 _toc = None 197 198 # Used to track when we're inside an ordered or unordered list 199 # (see _ProcessListItems() for details): 200 list_level = 0 201 202 _ws_only_line_re = re.compile(r"^[ \t]+$", re.M) 203 204 def __init__(self, html4tags=False, tab_width=4, safe_mode=None, 205 extras=None, link_patterns=None, 206 footnote_title=None, footnote_return_symbol=None, 207 use_file_vars=False, cli=False): 208 if html4tags: 209 self.empty_element_suffix = ">" 210 else: 211 self.empty_element_suffix = " />" 212 self.tab_width = tab_width 213 self.tab = tab_width * " " 214 215 # For compatibility with earlier markdown2.py and with 216 # markdown.py's safe_mode being a boolean, 217 # safe_mode == True -> "replace" 218 if safe_mode is True: 219 self.safe_mode = "replace" 220 else: 221 self.safe_mode = safe_mode 222 223 # Massaging and building the "extras" info. 224 if self.extras is None: 225 self.extras = {} 226 elif not isinstance(self.extras, dict): 227 self.extras = dict([(e, None) for e in self.extras]) 228 if extras: 229 if not isinstance(extras, dict): 230 extras = dict([(e, None) for e in extras]) 231 self.extras.update(extras) 232 assert isinstance(self.extras, dict) 233 234 if "toc" in self.extras: 235 if "header-ids" not in self.extras: 236 self.extras["header-ids"] = None # "toc" implies "header-ids" 237 238 if self.extras["toc"] is None: 239 self._toc_depth = 6 240 else: 241 self._toc_depth = self.extras["toc"].get("depth", 6) 242 self._instance_extras = self.extras.copy() 243 244 if 'link-patterns' in self.extras: 245 if link_patterns is None: 246 # if you have specified that the link-patterns extra SHOULD 247 # be used (via self.extras) but you haven't provided anything 248 # via the link_patterns argument then an error is raised 249 raise MarkdownError("If the 'link-patterns' extra is used, an argument for 'link_patterns' is required") 250 self.link_patterns = link_patterns 251 self.footnote_title = footnote_title 252 self.footnote_return_symbol = footnote_return_symbol 253 self.use_file_vars = use_file_vars 254 self._outdent_re = re.compile(r'^(\t|[ ]{1,%d})' % tab_width, re.M) 255 self.cli = cli 256 257 self._escape_table = g_escape_table.copy() 258 self._code_table = {} 259 if "smarty-pants" in self.extras: 260 self._escape_table['"'] = _hash_text('"') 261 self._escape_table["'"] = _hash_text("'") 262 263 def reset(self): 264 self.urls = {} 265 self.titles = {} 266 self.html_blocks = {} 267 self.html_spans = {} 268 self.list_level = 0 269 self.extras = self._instance_extras.copy() 270 self._setup_extras() 271 self._toc = None 272 273 def _setup_extras(self): 274 if "footnotes" in self.extras: 275 self.footnotes = {} 276 self.footnote_ids = [] 277 if "header-ids" in self.extras: 278 self._count_from_header_id = defaultdict(int) 279 if "metadata" in self.extras: 280 self.metadata = {} 281 282 # Per <https://developer.mozilla.org/en-US/docs/HTML/Element/a> "rel" 283 # should only be used in <a> tags with an "href" attribute. 284 285 # Opens the linked document in a new window or tab 286 # should only used in <a> tags with an "href" attribute. 287 # same with _a_nofollow 288 _a_nofollow_or_blank_links = re.compile(r""" 289 <(a) 290 ( 291 [^>]* 292 href= # href is required 293 ['"]? # HTML5 attribute values do not have to be quoted 294 [^#'"] # We don't want to match href values that start with # (like footnotes) 295 ) 296 """, 297 re.IGNORECASE | re.VERBOSE 298 ) 299 300 def convert(self, text): 301 """Convert the given text.""" 302 # Main function. The order in which other subs are called here is 303 # essential. Link and image substitutions need to happen before 304 # _EscapeSpecialChars(), so that any *'s or _'s in the <a> 305 # and <img> tags get encoded. 306 307 # Clear the global hashes. If we don't clear these, you get conflicts 308 # from other articles when generating a page which contains more than 309 # one article (e.g. an index page that shows the N most recent 310 # articles): 311 self.reset() 312 313 if not isinstance(text, str): 314 # TODO: perhaps shouldn't presume UTF-8 for string input? 315 text = str(text, 'utf-8') 316 317 if self.use_file_vars: 318 # Look for emacs-style file variable hints. 319 text = self._emacs_oneliner_vars_pat.sub(self._emacs_vars_oneliner_sub, text) 320 emacs_vars = self._get_emacs_vars(text) 321 if "markdown-extras" in emacs_vars: 322 splitter = re.compile("[ ,]+") 323 for e in splitter.split(emacs_vars["markdown-extras"]): 324 if '=' in e: 325 ename, earg = e.split('=', 1) 326 try: 327 earg = int(earg) 328 except ValueError: 329 pass 330 else: 331 ename, earg = e, None 332 self.extras[ename] = earg 333 334 self._setup_extras() 335 336 # Standardize line endings: 337 text = text.replace("\r\n", "\n") 338 text = text.replace("\r", "\n") 339 340 # Make sure $text ends with a couple of newlines: 341 text += "\n\n" 342 343 # Convert all tabs to spaces. 344 text = self._detab(text) 345 346 # Strip any lines consisting only of spaces and tabs. 347 # This makes subsequent regexen easier to write, because we can 348 # match consecutive blank lines with /\n+/ instead of something 349 # contorted like /[ \t]*\n+/ . 350 text = self._ws_only_line_re.sub("", text) 351 352 # strip metadata from head and extract 353 if "metadata" in self.extras: 354 text = self._extract_metadata(text) 355 356 text = self.preprocess(text) 357 358 if "fenced-code-blocks" in self.extras and not self.safe_mode: 359 text = self._do_fenced_code_blocks(text) 360 361 if self.safe_mode: 362 text = self._hash_html_spans(text) 363 364 # Turn block-level HTML blocks into hash entries 365 text = self._hash_html_blocks(text, raw=True) 366 367 if "fenced-code-blocks" in self.extras and self.safe_mode: 368 text = self._do_fenced_code_blocks(text) 369 370 if 'admonitions' in self.extras: 371 text = self._do_admonitions(text) 372 373 # Because numbering references aren't links (yet?) then we can do everything associated with counters 374 # before we get started 375 if "numbering" in self.extras: 376 text = self._do_numbering(text) 377 378 # Strip link definitions, store in hashes. 379 if "footnotes" in self.extras: 380 # Must do footnotes first because an unlucky footnote defn 381 # looks like a link defn: 382 # [^4]: this "looks like a link defn" 383 text = self._strip_footnote_definitions(text) 384 text = self._strip_link_definitions(text) 385 386 text = self._run_block_gamut(text) 387 388 if "footnotes" in self.extras: 389 text = self._add_footnotes(text) 390 391 text = self.postprocess(text) 392 393 text = self._unescape_special_chars(text) 394 395 if self.safe_mode: 396 text = self._unhash_html_spans(text) 397 # return the removed text warning to its markdown.py compatible form 398 text = text.replace(self.html_removed_text, self.html_removed_text_compat) 399 400 do_target_blank_links = "target-blank-links" in self.extras 401 do_nofollow_links = "nofollow" in self.extras 402 403 if do_target_blank_links and do_nofollow_links: 404 text = self._a_nofollow_or_blank_links.sub(r'<\1 rel="nofollow noopener" target="_blank"\2', text) 405 elif do_target_blank_links: 406 text = self._a_nofollow_or_blank_links.sub(r'<\1 rel="noopener" target="_blank"\2', text) 407 elif do_nofollow_links: 408 text = self._a_nofollow_or_blank_links.sub(r'<\1 rel="nofollow"\2', text) 409 410 if "toc" in self.extras and self._toc: 411 self._toc_html = calculate_toc_html(self._toc) 412 413 # Prepend toc html to output 414 if self.cli: 415 text = '{}\n{}'.format(self._toc_html, text) 416 417 text += "\n" 418 419 # Attach attrs to output 420 rv = UnicodeWithAttrs(text) 421 422 if "toc" in self.extras and self._toc: 423 rv.toc_html = self._toc_html 424 425 if "metadata" in self.extras: 426 rv.metadata = self.metadata 427 return rv 428 429 def postprocess(self, text): 430 """A hook for subclasses to do some postprocessing of the html, if 431 desired. This is called before unescaping of special chars and 432 unhashing of raw HTML spans. 433 """ 434 return text 435 436 def preprocess(self, text): 437 """A hook for subclasses to do some preprocessing of the Markdown, if 438 desired. This is called after basic formatting of the text, but prior 439 to any extras, safe mode, etc. processing. 440 """ 441 return text 442 443 # Is metadata if the content starts with optional '---'-fenced `key: value` 444 # pairs. E.g. (indented for presentation): 445 # --- 446 # foo: bar 447 # another-var: blah blah 448 # --- 449 # # header 450 # or: 451 # foo: bar 452 # another-var: blah blah 453 # 454 # # header 455 _meta_data_pattern = re.compile(r''' 456 ^(?:---[\ \t]*\n)?( # optional opening fence 457 (?: 458 [\S \t]*\w[\S \t]*\s*:(?:\n+[ \t]+.*)+ # indented lists 459 )|(?: 460 (?:[\S \t]*\w[\S \t]*\s*:\s+>(?:\n\s+.*)+?) # multiline long descriptions 461 (?=\n[\S \t]*\w[\S \t]*\s*:\s*.*\n|\s*\Z) # match up until the start of the next key:value definition or the end of the input text 462 )|(?: 463 [\S \t]*\w[\S \t]*\s*:(?! >).*\n? # simple key:value pair, leading spaces allowed 464 ) 465 )(?:---[\ \t]*\n)? # optional closing fence 466 ''', re.MULTILINE | re.VERBOSE 467 ) 468 469 _key_val_list_pat = re.compile( 470 r"^-(?:[ \t]*([^\n]*)(?:[ \t]*[:-][ \t]*(\S+))?)(?:\n((?:[ \t]+[^\n]+\n?)+))?", 471 re.MULTILINE, 472 ) 473 _key_val_dict_pat = re.compile( 474 r"^([^:\n]+)[ \t]*:[ \t]*([^\n]*)(?:((?:\n[ \t]+[^\n]+)+))?", re.MULTILINE 475 ) # grp0: key, grp1: value, grp2: multiline value 476 _meta_data_fence_pattern = re.compile(r'^---[\ \t]*\n', re.MULTILINE) 477 _meta_data_newline = re.compile("^\n", re.MULTILINE) 478 479 def _extract_metadata(self, text): 480 if text.startswith("---"): 481 fence_splits = re.split(self._meta_data_fence_pattern, text, maxsplit=2) 482 metadata_content = fence_splits[1] 483 match = re.findall(self._meta_data_pattern, metadata_content) 484 if not match: 485 return text 486 tail = fence_splits[2] 487 else: 488 metadata_split = re.split(self._meta_data_newline, text, maxsplit=1) 489 metadata_content = metadata_split[0] 490 match = re.findall(self._meta_data_pattern, metadata_content) 491 if not match: 492 return text 493 tail = metadata_split[1] 494 495 def parse_structured_value(value): 496 vs = value.lstrip() 497 vs = value.replace(v[: len(value) - len(vs)], "\n")[1:] 498 499 # List 500 if vs.startswith("-"): 501 r = [] 502 for match in re.findall(self._key_val_list_pat, vs): 503 if match[0] and not match[1] and not match[2]: 504 r.append(match[0].strip()) 505 elif match[0] == ">" and not match[1] and match[2]: 506 r.append(match[2].strip()) 507 elif match[0] and match[1]: 508 r.append({match[0].strip(): match[1].strip()}) 509 elif not match[0] and not match[1] and match[2]: 510 r.append(parse_structured_value(match[2])) 511 else: 512 # Broken case 513 pass 514 515 return r 516 517 # Dict 518 else: 519 return { 520 match[0].strip(): ( 521 match[1].strip() 522 if match[1] 523 else parse_structured_value(match[2]) 524 ) 525 for match in re.findall(self._key_val_dict_pat, vs) 526 } 527 528 for item in match: 529 530 k, v = item.split(":", 1) 531 532 # Multiline value 533 if v[:3] == " >\n": 534 self.metadata[k.strip()] = _dedent(v[3:]).strip() 535 536 # Empty value 537 elif v == "\n": 538 self.metadata[k.strip()] = "" 539 540 # Structured value 541 elif v[0] == "\n": 542 self.metadata[k.strip()] = parse_structured_value(v) 543 544 # Simple value 545 else: 546 self.metadata[k.strip()] = v.strip() 547 548 return tail 549 550 _emacs_oneliner_vars_pat = re.compile(r"((?:<!--)?\s*-\*-)\s*(?:(\S[^\r\n]*?)([\r\n]\s*)?)?(-\*-\s*(?:-->)?)", 551 re.UNICODE) 552 # This regular expression is intended to match blocks like this: 553 # PREFIX Local Variables: SUFFIX 554 # PREFIX mode: Tcl SUFFIX 555 # PREFIX End: SUFFIX 556 # Some notes: 557 # - "[ \t]" is used instead of "\s" to specifically exclude newlines 558 # - "(\r\n|\n|\r)" is used instead of "$" because the sre engine does 559 # not like anything other than Unix-style line terminators. 560 _emacs_local_vars_pat = re.compile(r"""^ 561 (?P<prefix>(?:[^\r\n|\n|\r])*?) 562 [\ \t]*Local\ Variables:[\ \t]* 563 (?P<suffix>.*?)(?:\r\n|\n|\r) 564 (?P<content>.*?\1End:) 565 """, re.IGNORECASE | re.MULTILINE | re.DOTALL | re.VERBOSE) 566 567 def _emacs_vars_oneliner_sub(self, match): 568 if match.group(1).strip() == '-*-' and match.group(4).strip() == '-*-': 569 lead_ws = re.findall(r'^\s*', match.group(1))[0] 570 tail_ws = re.findall(r'\s*$', match.group(4))[0] 571 return '%s<!-- %s %s %s -->%s' % (lead_ws, '-*-', match.group(2).strip(), '-*-', tail_ws) 572 573 start, end = match.span() 574 return match.string[start: end] 575 576 def _get_emacs_vars(self, text): 577 """Return a dictionary of emacs-style local variables. 578 579 Parsing is done loosely according to this spec (and according to 580 some in-practice deviations from this): 581 http://www.gnu.org/software/emacs/manual/html_node/emacs/Specifying-File-Variables.html#Specifying-File-Variables 582 """ 583 emacs_vars = {} 584 SIZE = pow(2, 13) # 8kB 585 586 # Search near the start for a '-*-'-style one-liner of variables. 587 head = text[:SIZE] 588 if "-*-" in head: 589 match = self._emacs_oneliner_vars_pat.search(head) 590 if match: 591 emacs_vars_str = match.group(2) 592 assert '\n' not in emacs_vars_str 593 emacs_var_strs = [s.strip() for s in emacs_vars_str.split(';') 594 if s.strip()] 595 if len(emacs_var_strs) == 1 and ':' not in emacs_var_strs[0]: 596 # While not in the spec, this form is allowed by emacs: 597 # -*- Tcl -*- 598 # where the implied "variable" is "mode". This form 599 # is only allowed if there are no other variables. 600 emacs_vars["mode"] = emacs_var_strs[0].strip() 601 else: 602 for emacs_var_str in emacs_var_strs: 603 try: 604 variable, value = emacs_var_str.strip().split(':', 1) 605 except ValueError: 606 log.debug("emacs variables error: malformed -*- " 607 "line: %r", emacs_var_str) 608 continue 609 # Lowercase the variable name because Emacs allows "Mode" 610 # or "mode" or "MoDe", etc. 611 emacs_vars[variable.lower()] = value.strip() 612 613 tail = text[-SIZE:] 614 if "Local Variables" in tail: 615 match = self._emacs_local_vars_pat.search(tail) 616 if match: 617 prefix = match.group("prefix") 618 suffix = match.group("suffix") 619 lines = match.group("content").splitlines(0) 620 # print "prefix=%r, suffix=%r, content=%r, lines: %s"\ 621 # % (prefix, suffix, match.group("content"), lines) 622 623 # Validate the Local Variables block: proper prefix and suffix 624 # usage. 625 for i, line in enumerate(lines): 626 if not line.startswith(prefix): 627 log.debug("emacs variables error: line '%s' " 628 "does not use proper prefix '%s'" 629 % (line, prefix)) 630 return {} 631 # Don't validate suffix on last line. Emacs doesn't care, 632 # neither should we. 633 if i != len(lines) - 1 and not line.endswith(suffix): 634 log.debug("emacs variables error: line '%s' " 635 "does not use proper suffix '%s'" 636 % (line, suffix)) 637 return {} 638 639 # Parse out one emacs var per line. 640 continued_for = None 641 for line in lines[:-1]: # no var on the last line ("PREFIX End:") 642 if prefix: line = line[len(prefix):] # strip prefix 643 if suffix: line = line[:-len(suffix)] # strip suffix 644 line = line.strip() 645 if continued_for: 646 variable = continued_for 647 if line.endswith('\\'): 648 line = line[:-1].rstrip() 649 else: 650 continued_for = None 651 emacs_vars[variable] += ' ' + line 652 else: 653 try: 654 variable, value = line.split(':', 1) 655 except ValueError: 656 log.debug("local variables error: missing colon " 657 "in local variables entry: '%s'" % line) 658 continue 659 # Do NOT lowercase the variable name, because Emacs only 660 # allows "mode" (and not "Mode", "MoDe", etc.) in this block. 661 value = value.strip() 662 if value.endswith('\\'): 663 value = value[:-1].rstrip() 664 continued_for = variable 665 else: 666 continued_for = None 667 emacs_vars[variable] = value 668 669 # Unquote values. 670 for var, val in list(emacs_vars.items()): 671 if len(val) > 1 and (val.startswith('"') and val.endswith('"') 672 or val.startswith('"') and val.endswith('"')): 673 emacs_vars[var] = val[1:-1] 674 675 return emacs_vars 676 677 def _detab_line(self, line): 678 r"""Recusively convert tabs to spaces in a single line. 679 680 Called from _detab().""" 681 if '\t' not in line: 682 return line 683 chunk1, chunk2 = line.split('\t', 1) 684 chunk1 += (' ' * (self.tab_width - len(chunk1) % self.tab_width)) 685 output = chunk1 + chunk2 686 return self._detab_line(output) 687 688 def _detab(self, text): 689 r"""Iterate text line by line and convert tabs to spaces. 690 691 >>> m = Markdown() 692 >>> m._detab("\tfoo") 693 ' foo' 694 >>> m._detab(" \tfoo") 695 ' foo' 696 >>> m._detab("\t foo") 697 ' foo' 698 >>> m._detab(" foo") 699 ' foo' 700 >>> m._detab(" foo\n\tbar\tblam") 701 ' foo\n bar blam' 702 """ 703 if '\t' not in text: 704 return text 705 output = [] 706 for line in text.splitlines(): 707 output.append(self._detab_line(line)) 708 return '\n'.join(output) 709 710 # I broke out the html5 tags here and add them to _block_tags_a and 711 # _block_tags_b. This way html5 tags are easy to keep track of. 712 _html5tags = '|article|aside|header|hgroup|footer|nav|section|figure|figcaption' 713 714 _block_tags_a = 'p|div|h[1-6]|blockquote|pre|table|dl|ol|ul|script|noscript|form|fieldset|iframe|math|ins|del' 715 _block_tags_a += _html5tags 716 717 _strict_tag_block_re = re.compile(r""" 718 ( # save in \1 719 ^ # start of line (with re.M) 720 <(%s) # start tag = \2 721 \b # word break 722 (.*\n)*? # any number of lines, minimally matching 723 </\2> # the matching end tag 724 [ \t]* # trailing spaces/tabs 725 (?=\n+|\Z) # followed by a newline or end of document 726 ) 727 """ % _block_tags_a, 728 re.X | re.M) 729 730 _block_tags_b = 'p|div|h[1-6]|blockquote|pre|table|dl|ol|ul|script|noscript|form|fieldset|iframe|math' 731 _block_tags_b += _html5tags 732 733 _liberal_tag_block_re = re.compile(r""" 734 ( # save in \1 735 ^ # start of line (with re.M) 736 <(%s) # start tag = \2 737 \b # word break 738 (.*\n)*? # any number of lines, minimally matching 739 .*</\2> # the matching end tag 740 [ \t]* # trailing spaces/tabs 741 (?=\n+|\Z) # followed by a newline or end of document 742 ) 743 """ % _block_tags_b, 744 re.X | re.M) 745 746 _html_markdown_attr_re = re.compile( 747 r'''\s+markdown=("1"|'1')''') 748 749 def _hash_html_block_sub(self, match, raw=False): 750 html = match.group(1) 751 if raw and self.safe_mode: 752 html = self._sanitize_html(html) 753 elif 'markdown-in-html' in self.extras and 'markdown=' in html: 754 first_line = html.split('\n', 1)[0] 755 m = self._html_markdown_attr_re.search(first_line) 756 if m: 757 lines = html.split('\n') 758 middle = '\n'.join(lines[1:-1]) 759 last_line = lines[-1] 760 first_line = first_line[:m.start()] + first_line[m.end():] 761 f_key = _hash_text(first_line) 762 self.html_blocks[f_key] = first_line 763 l_key = _hash_text(last_line) 764 self.html_blocks[l_key] = last_line 765 return ''.join(["\n\n", f_key, 766 "\n\n", middle, "\n\n", 767 l_key, "\n\n"]) 768 key = _hash_text(html) 769 self.html_blocks[key] = html 770 return "\n\n" + key + "\n\n" 771 772 def _hash_html_blocks(self, text, raw=False): 773 """Hashify HTML blocks 774 775 We only want to do this for block-level HTML tags, such as headers, 776 lists, and tables. That's because we still want to wrap <p>s around 777 "paragraphs" that are wrapped in non-block-level tags, such as anchors, 778 phrase emphasis, and spans. The list of tags we're looking for is 779 hard-coded. 780 781 @param raw {boolean} indicates if these are raw HTML blocks in 782 the original source. It makes a difference in "safe" mode. 783 """ 784 if '<' not in text: 785 return text 786 787 # Pass `raw` value into our calls to self._hash_html_block_sub. 788 hash_html_block_sub = _curry(self._hash_html_block_sub, raw=raw) 789 790 # First, look for nested blocks, e.g.: 791 # <div> 792 # <div> 793 # tags for inner block must be indented. 794 # </div> 795 # </div> 796 # 797 # The outermost tags must start at the left margin for this to match, and 798 # the inner nested divs must be indented. 799 # We need to do this before the next, more liberal match, because the next 800 # match will start at the first `<div>` and stop at the first `</div>`. 801 text = self._strict_tag_block_re.sub(hash_html_block_sub, text) 802 803 # Now match more liberally, simply from `\n<tag>` to `</tag>\n` 804 text = self._liberal_tag_block_re.sub(hash_html_block_sub, text) 805 806 # Special case just for <hr />. It was easier to make a special 807 # case than to make the other regex more complicated. 808 if "<hr" in text: 809 _hr_tag_re = _hr_tag_re_from_tab_width(self.tab_width) 810 text = _hr_tag_re.sub(hash_html_block_sub, text) 811 812 # Special case for standalone HTML comments: 813 if "<!--" in text: 814 start = 0 815 while True: 816 # Delimiters for next comment block. 817 try: 818 start_idx = text.index("<!--", start) 819 except ValueError: 820 break 821 try: 822 end_idx = text.index("-->", start_idx) + 3 823 except ValueError: 824 break 825 826 # Start position for next comment block search. 827 start = end_idx 828 829 # Validate whitespace before comment. 830 if start_idx: 831 # - Up to `tab_width - 1` spaces before start_idx. 832 for i in range(self.tab_width - 1): 833 if text[start_idx - 1] != ' ': 834 break 835 start_idx -= 1 836 if start_idx == 0: 837 break 838 # - Must be preceded by 2 newlines or hit the start of 839 # the document. 840 if start_idx == 0: 841 pass 842 elif start_idx == 1 and text[0] == '\n': 843 start_idx = 0 # to match minute detail of Markdown.pl regex 844 elif text[start_idx - 2:start_idx] == '\n\n': 845 pass 846 else: 847 break 848 849 # Validate whitespace after comment. 850 # - Any number of spaces and tabs. 851 while end_idx < len(text): 852 if text[end_idx] not in ' \t': 853 break 854 end_idx += 1 855 # - Must be following by 2 newlines or hit end of text. 856 if text[end_idx:end_idx + 2] not in ('', '\n', '\n\n'): 857 continue 858 859 # Escape and hash (must match `_hash_html_block_sub`). 860 html = text[start_idx:end_idx] 861 if raw and self.safe_mode: 862 html = self._sanitize_html(html) 863 key = _hash_text(html) 864 self.html_blocks[key] = html 865 text = text[:start_idx] + "\n\n" + key + "\n\n" + text[end_idx:] 866 867 if "xml" in self.extras: 868 # Treat XML processing instructions and namespaced one-liner 869 # tags as if they were block HTML tags. E.g., if standalone 870 # (i.e. are their own paragraph), the following do not get 871 # wrapped in a <p> tag: 872 # <?foo bar?> 873 # 874 # <xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="chapter_1.md"/> 875 _xml_oneliner_re = _xml_oneliner_re_from_tab_width(self.tab_width) 876 text = _xml_oneliner_re.sub(hash_html_block_sub, text) 877 878 return text 879 880 def _strip_link_definitions(self, text): 881 # Strips link definitions from text, stores the URLs and titles in 882 # hash references. 883 less_than_tab = self.tab_width - 1 884 885 # Link defs are in the form: 886 # [id]: url "optional title" 887 _link_def_re = re.compile(r""" 888 ^[ ]{0,%d}\[(.+)\]: # id = \1 889 [ \t]* 890 \n? # maybe *one* newline 891 [ \t]* 892 <?(.+?)>? # url = \2 893 [ \t]* 894 (?: 895 \n? # maybe one newline 896 [ \t]* 897 (?<=\s) # lookbehind for whitespace 898 ['"(] 899 ([^\n]*) # title = \3 900 ['")] 901 [ \t]* 902 )? # title is optional 903 (?:\n+|\Z) 904 """ % less_than_tab, re.X | re.M | re.U) 905 return _link_def_re.sub(self._extract_link_def_sub, text) 906 907 def _extract_link_def_sub(self, match): 908 id, url, title = match.groups() 909 key = id.lower() # Link IDs are case-insensitive 910 self.urls[key] = self._encode_amps_and_angles(url) 911 if title: 912 self.titles[key] = title 913 return "" 914 915 def _do_numbering(self, text): 916 ''' We handle the special extension for generic numbering for 917 tables, figures etc. 918 ''' 919 # First pass to define all the references 920 self.regex_defns = re.compile(r''' 921 \[\#(\w+) # the counter. Open square plus hash plus a word \1 922 ([^@]*) # Some optional characters, that aren't an @. \2 923 @(\w+) # the id. Should this be normed? \3 924 ([^\]]*)\] # The rest of the text up to the terminating ] \4 925 ''', re.VERBOSE) 926 self.regex_subs = re.compile(r"\[@(\w+)\s*\]") # [@ref_id] 927 counters = {} 928 references = {} 929 replacements = [] 930 definition_html = '<figcaption class="{}" id="counter-ref-{}">{}{}{}</figcaption>' 931 reference_html = '<a class="{}" href="#counter-ref-{}">{}</a>' 932 for match in self.regex_defns.finditer(text): 933 # We must have four match groups otherwise this isn't a numbering reference 934 if len(match.groups()) != 4: 935 continue 936 counter = match.group(1) 937 text_before = match.group(2).strip() 938 ref_id = match.group(3) 939 text_after = match.group(4) 940 number = counters.get(counter, 1) 941 references[ref_id] = (number, counter) 942 replacements.append((match.start(0), 943 definition_html.format(counter, 944 ref_id, 945 text_before, 946 number, 947 text_after), 948 match.end(0))) 949 counters[counter] = number + 1 950 for repl in reversed(replacements): 951 text = text[:repl[0]] + repl[1] + text[repl[2]:] 952 953 # Second pass to replace the references with the right 954 # value of the counter 955 # Fwiw, it's vaguely annoying to have to turn the iterator into 956 # a list and then reverse it but I can't think of a better thing to do. 957 for match in reversed(list(self.regex_subs.finditer(text))): 958 number, counter = references.get(match.group(1), (None, None)) 959 if number is not None: 960 repl = reference_html.format(counter, 961 match.group(1), 962 number) 963 else: 964 repl = reference_html.format(match.group(1), 965 'countererror', 966 '?' + match.group(1) + '?') 967 if "smarty-pants" in self.extras: 968 repl = repl.replace('"', self._escape_table['"']) 969 970 text = text[:match.start()] + repl + text[match.end():] 971 return text 972 973 def _extract_footnote_def_sub(self, match): 974 id, text = match.groups() 975 text = _dedent(text, skip_first_line=not text.startswith('\n')).strip() 976 normed_id = re.sub(r'\W', '-', id) 977 # Ensure footnote text ends with a couple newlines (for some 978 # block gamut matches). 979 self.footnotes[normed_id] = text + "\n\n" 980 return "" 981 982 def _strip_footnote_definitions(self, text): 983 """A footnote definition looks like this: 984 985 [^note-id]: Text of the note. 986 987 May include one or more indented paragraphs. 988 989 Where, 990 - The 'note-id' can be pretty much anything, though typically it 991 is the number of the footnote. 992 - The first paragraph may start on the next line, like so: 993 994 [^note-id]: 995 Text of the note. 996 """ 997 less_than_tab = self.tab_width - 1 998 footnote_def_re = re.compile(r''' 999 ^[ ]{0,%d}\[\^(.+)\]: # id = \1 1000 [ \t]* 1001 ( # footnote text = \2 1002 # First line need not start with the spaces. 1003 (?:\s*.*\n+) 1004 (?: 1005 (?:[ ]{%d} | \t) # Subsequent lines must be indented. 1006 .*\n+ 1007 )* 1008 ) 1009 # Lookahead for non-space at line-start, or end of doc. 1010 (?:(?=^[ ]{0,%d}\S)|\Z) 1011 ''' % (less_than_tab, self.tab_width, self.tab_width), 1012 re.X | re.M) 1013 return footnote_def_re.sub(self._extract_footnote_def_sub, text) 1014 1015 _hr_re = re.compile(r'^[ ]{0,3}([-_*])[ ]{0,2}(\1[ ]{0,2}){2,}$', re.M) 1016 1017 def _run_block_gamut(self, text): 1018 # These are all the transformations that form block-level 1019 # tags like paragraphs, headers, and list items. 1020 1021 if 'admonitions' in self.extras: 1022 text = self._do_admonitions(text) 1023 1024 if "fenced-code-blocks" in self.extras: 1025 text = self._do_fenced_code_blocks(text) 1026 1027 text = self._do_headers(text) 1028 1029 # Do Horizontal Rules: 1030 # On the number of spaces in horizontal rules: The spec is fuzzy: "If 1031 # you wish, you may use spaces between the hyphens or asterisks." 1032 # Markdown.pl 1.0.1's hr regexes limit the number of spaces between the 1033 # hr chars to one or two. We'll reproduce that limit here. 1034 hr = "\n<hr" + self.empty_element_suffix + "\n" 1035 text = re.sub(self._hr_re, hr, text) 1036 1037 text = self._do_lists(text) 1038 1039 if "pyshell" in self.extras: 1040 text = self._prepare_pyshell_blocks(text) 1041 if "wiki-tables" in self.extras: 1042 text = self._do_wiki_tables(text) 1043 if "tables" in self.extras: 1044 text = self._do_tables(text) 1045 1046 text = self._do_code_blocks(text) 1047 1048 text = self._do_block_quotes(text) 1049 1050 # We already ran _HashHTMLBlocks() before, in Markdown(), but that 1051 # was to escape raw HTML in the original Markdown source. This time, 1052 # we're escaping the markup we've just created, so that we don't wrap 1053 # <p> tags around block-level tags. 1054 text = self._hash_html_blocks(text) 1055 1056 text = self._form_paragraphs(text) 1057 1058 return text 1059 1060 def _pyshell_block_sub(self, match): 1061 if "fenced-code-blocks" in self.extras: 1062 dedented = _dedent(match.group(0)) 1063 return self._do_fenced_code_blocks("```pycon\n" + dedented + "```\n") 1064 lines = match.group(0).splitlines(0) 1065 _dedentlines(lines) 1066 indent = ' ' * self.tab_width 1067 s = ('\n' # separate from possible cuddled paragraph 1068 + indent + ('\n' + indent).join(lines) 1069 + '\n') 1070 return s 1071 1072 def _prepare_pyshell_blocks(self, text): 1073 """Ensure that Python interactive shell sessions are put in 1074 code blocks -- even if not properly indented. 1075 """ 1076 if ">>>" not in text: 1077 return text 1078 1079 less_than_tab = self.tab_width - 1 1080 _pyshell_block_re = re.compile(r""" 1081 ^([ ]{0,%d})>>>[ ].*\n # first line 1082 ^(\1[^\S\n]*\S.*\n)* # any number of subsequent lines with at least one character 1083 (?=^\1?\n|\Z) # ends with a blank line or end of document 1084 """ % less_than_tab, re.M | re.X) 1085 1086 return _pyshell_block_re.sub(self._pyshell_block_sub, text) 1087 1088 def _table_sub(self, match): 1089 trim_space_re = '^[ \t\n]+|[ \t\n]+$' 1090 trim_bar_re = r'^\||\|$' 1091 split_bar_re = r'^\||(?<![\`\\])\|' 1092 escape_bar_re = r'\\\|' 1093 1094 head, underline, body = match.groups() 1095 1096 # Determine aligns for columns. 1097 cols = [re.sub(escape_bar_re, '|', cell.strip()) for cell in 1098 re.split(split_bar_re, re.sub(trim_bar_re, "", re.sub(trim_space_re, "", underline)))] 1099 align_from_col_idx = {} 1100 for col_idx, col in enumerate(cols): 1101 if col[0] == ':' and col[-1] == ':': 1102 align_from_col_idx[col_idx] = ' style="text-align:center;"' 1103 elif col[0] == ':': 1104 align_from_col_idx[col_idx] = ' style="text-align:left;"' 1105 elif col[-1] == ':': 1106 align_from_col_idx[col_idx] = ' style="text-align:right;"' 1107 1108 # thead 1109 hlines = ['<table%s>' % self._html_class_str_from_tag('table'), '<thead>', '<tr>'] 1110 cols = [re.sub(escape_bar_re, '|', cell.strip()) for cell in 1111 re.split(split_bar_re, re.sub(trim_bar_re, "", re.sub(trim_space_re, "", head)))] 1112 for col_idx, col in enumerate(cols): 1113 hlines.append(' <th%s>%s</th>' % ( 1114 align_from_col_idx.get(col_idx, ''), 1115 self._run_span_gamut(col) 1116 )) 1117 hlines.append('</tr>') 1118 hlines.append('</thead>') 1119 1120 # tbody 1121 hlines.append('<tbody>') 1122 for line in body.strip('\n').split('\n'): 1123 hlines.append('<tr>') 1124 cols = [re.sub(escape_bar_re, '|', cell.strip()) for cell in 1125 re.split(split_bar_re, re.sub(trim_bar_re, "", re.sub(trim_space_re, "", line)))] 1126 for col_idx, col in enumerate(cols): 1127 hlines.append(' <td%s>%s</td>' % ( 1128 align_from_col_idx.get(col_idx, ''), 1129 self._run_span_gamut(col) 1130 )) 1131 hlines.append('</tr>') 1132 hlines.append('</tbody>') 1133 hlines.append('</table>') 1134 1135 return '\n'.join(hlines) + '\n' 1136 1137 def _do_tables(self, text): 1138 """Copying PHP-Markdown and GFM table syntax. Some regex borrowed from 1139 https://github.com/michelf/php-markdown/blob/lib/Michelf/Markdown.php#L2538 1140 """ 1141 less_than_tab = self.tab_width - 1 1142 table_re = re.compile(r''' 1143 (?:(?<=\n\n)|\A\n?) # leading blank line 1144 1145 ^[ ]{0,%d} # allowed whitespace 1146 (.*[|].*) \n # $1: header row (at least one pipe) 1147 1148 ^[ ]{0,%d} # allowed whitespace 1149 ( # $2: underline row 1150 # underline row with leading bar 1151 (?: \|\ *:?-+:?\ * )+ \|? \s? \n 1152 | 1153 # or, underline row without leading bar 1154 (?: \ *:?-+:?\ *\| )+ (?: \ *:?-+:?\ * )? \s? \n 1155 ) 1156 1157 ( # $3: data rows 1158 (?: 1159 ^[ ]{0,%d}(?!\ ) # ensure line begins with 0 to less_than_tab spaces 1160 .*\|.* \n 1161 )+ 1162 ) 1163 ''' % (less_than_tab, less_than_tab, less_than_tab), re.M | re.X) 1164 return table_re.sub(self._table_sub, text) 1165 1166 def _wiki_table_sub(self, match): 1167 ttext = match.group(0).strip() 1168 # print('wiki table: %r' % match.group(0)) 1169 rows = [] 1170 for line in ttext.splitlines(0): 1171 line = line.strip()[2:-2].strip() 1172 row = [c.strip() for c in re.split(r'(?<!\\)\|\|', line)] 1173 rows.append(row) 1174 # from pprint import pprint 1175 # pprint(rows) 1176 hlines = [] 1177 1178 def add_hline(line, indents=0): 1179 hlines.append((self.tab * indents) + line) 1180 1181 def format_cell(text): 1182 return self._run_span_gamut(re.sub(r"^\s*~", "", cell).strip(" ")) 1183 1184 add_hline('<table%s>' % self._html_class_str_from_tag('table')) 1185 # Check if first cell of first row is a header cell. If so, assume the whole row is a header row. 1186 if rows and rows[0] and re.match(r"^\s*~", rows[0][0]): 1187 add_hline('<thead>', 1) 1188 add_hline('<tr>', 2) 1189 for cell in rows[0]: 1190 add_hline("<th>{}</th>".format(format_cell(cell)), 3) 1191 add_hline('</tr>', 2) 1192 add_hline('</thead>', 1) 1193 # Only one header row allowed. 1194 rows = rows[1:] 1195 # If no more rows, don't create a tbody. 1196 if rows: 1197 add_hline('<tbody>', 1) 1198 for row in rows: 1199 add_hline('<tr>', 2) 1200 for cell in row: 1201 add_hline('<td>{}</td>'.format(format_cell(cell)), 3) 1202 add_hline('</tr>', 2) 1203 add_hline('</tbody>', 1) 1204 add_hline('</table>') 1205 return '\n'.join(hlines) + '\n' 1206 1207 def _do_wiki_tables(self, text): 1208 # Optimization. 1209 if "||" not in text: 1210 return text 1211 1212 less_than_tab = self.tab_width - 1 1213 wiki_table_re = re.compile(r''' 1214 (?:(?<=\n\n)|\A\n?) # leading blank line 1215 ^([ ]{0,%d})\|\|.+?\|\|[ ]*\n # first line 1216 (^\1\|\|.+?\|\|\n)* # any number of subsequent lines 1217 ''' % less_than_tab, re.M | re.X) 1218 return wiki_table_re.sub(self._wiki_table_sub, text) 1219 1220 def _run_span_gamut(self, text): 1221 # These are all the transformations that occur *within* block-level 1222 # tags like paragraphs, headers, and list items. 1223 1224 text = self._do_code_spans(text) 1225 1226 text = self._escape_special_chars(text) 1227 1228 # Process anchor and image tags. 1229 if "link-patterns" in self.extras: 1230 text = self._do_link_patterns(text) 1231 1232 text = self._do_links(text) 1233 1234 # Make links out of things like `<http://example.com/>` 1235 # Must come after _do_links(), because you can use < and > 1236 # delimiters in inline links like [this](<url>). 1237 text = self._do_auto_links(text) 1238 1239 text = self._encode_amps_and_angles(text) 1240 1241 if "strike" in self.extras: 1242 text = self._do_strike(text) 1243 1244 if "underline" in self.extras: 1245 text = self._do_underline(text) 1246 1247 text = self._do_italics_and_bold(text) 1248 1249 if "smarty-pants" in self.extras: 1250 text = self._do_smart_punctuation(text) 1251 1252 # Do hard breaks: 1253 if "break-on-newline" in self.extras: 1254 text = re.sub(r" *\n(?!\<(?:\/?(ul|ol|li))\>)", "<br%s\n" % self.empty_element_suffix, text) 1255 else: 1256 text = re.sub(r" {2,}\n", " <br%s\n" % self.empty_element_suffix, text) 1257 1258 return text 1259 1260 # "Sorta" because auto-links are identified as "tag" tokens. 1261 _sorta_html_tokenize_re = re.compile(r""" 1262 ( 1263 # tag 1264 </? 1265 (?:\w+) # tag name 1266 (?:\s+(?:[\w-]+:)?[\w-]+=(?:".*?"|'.*?'))* # attributes 1267 \s*/?> 1268 | 1269 # auto-link (e.g., <http://www.activestate.com/>) 1270 <[\w~:/?#\[\]@!$&'\(\)*+,;%=\.\\-]+> 1271 | 1272 <!--.*?--> # comment 1273 | 1274 <\?.*?\?> # processing instruction 1275 ) 1276 """, re.X) 1277 1278 def _escape_special_chars(self, text): 1279 # Python markdown note: the HTML tokenization here differs from 1280 # that in Markdown.pl, hence the behaviour for subtle cases can 1281 # differ (I believe the tokenizer here does a better job because 1282 # it isn't susceptible to unmatched '<' and '>' in HTML tags). 1283 # Note, however, that '>' is not allowed in an auto-link URL 1284 # here. 1285 escaped = [] 1286 is_html_markup = False 1287 for token in self._sorta_html_tokenize_re.split(text): 1288 if is_html_markup: 1289 # Within tags/HTML-comments/auto-links, encode * and _ 1290 # so they don't conflict with their use in Markdown for 1291 # italics and strong. We're replacing each such 1292 # character with its corresponding MD5 checksum value; 1293 # this is likely overkill, but it should prevent us from 1294 # colliding with the escape values by accident. 1295 escaped.append(token.replace('*', self._escape_table['*']) 1296 .replace('_', self._escape_table['_'])) 1297 else: 1298 escaped.append(self._encode_backslash_escapes(token)) 1299 is_html_markup = not is_html_markup 1300 return ''.join(escaped) 1301 1302 def _hash_html_spans(self, text): 1303 # Used for safe_mode. 1304 1305 def _is_auto_link(s): 1306 if ':' in s and self._auto_link_re.match(s): 1307 return True 1308 elif '@' in s and self._auto_email_link_re.match(s): 1309 return True 1310 return False 1311 1312 def _is_code_span(index, token): 1313 try: 1314 if token == '<code>': 1315 peek_tokens = split_tokens[index: index + 3] 1316 elif token == '</code>': 1317 peek_tokens = split_tokens[index - 2: index + 1] 1318 else: 1319 return False 1320 except IndexError: 1321 return False 1322 1323 return re.match(r'<code>md5-[A-Fa-f0-9]{32}</code>', ''.join(peek_tokens)) 1324 1325 tokens = [] 1326 split_tokens = self._sorta_html_tokenize_re.split(text) 1327 is_html_markup = False 1328 for index, token in enumerate(split_tokens): 1329 if is_html_markup and not _is_auto_link(token) and not _is_code_span(index, token): 1330 sanitized = self._sanitize_html(token) 1331 key = _hash_text(sanitized) 1332 self.html_spans[key] = sanitized 1333 tokens.append(key) 1334 else: 1335 tokens.append(self._encode_incomplete_tags(token)) 1336 is_html_markup = not is_html_markup 1337 return ''.join(tokens) 1338 1339 def _unhash_html_spans(self, text): 1340 for key, sanitized in list(self.html_spans.items()): 1341 text = text.replace(key, sanitized) 1342 return text 1343 1344 def _sanitize_html(self, s): 1345 if self.safe_mode == "replace": 1346 return self.html_removed_text 1347 elif self.safe_mode == "escape": 1348 replacements = [ 1349 ('&', '&'), 1350 ('<', '<'), 1351 ('>', '>'), 1352 ] 1353 for before, after in replacements: 1354 s = s.replace(before, after) 1355 return s 1356 else: 1357 raise MarkdownError("invalid value for 'safe_mode': %r (must be " 1358 "'escape' or 'replace')" % self.safe_mode) 1359 1360 _inline_link_title = re.compile(r''' 1361 ( # \1 1362 [ \t]+ 1363 (['"]) # quote char = \2 1364 (?P<title>.*?) 1365 \2 1366 )? # title is optional 1367 \)$ 1368 ''', re.X | re.S) 1369 _tail_of_reference_link_re = re.compile(r''' 1370 # Match tail of: [text][id] 1371 [ ]? # one optional space 1372 (?:\n[ ]*)? # one optional newline followed by spaces 1373 \[ 1374 (?P<id>.*?) 1375 \] 1376 ''', re.X | re.S) 1377 1378 _whitespace = re.compile(r'\s*') 1379 1380 _strip_anglebrackets = re.compile(r'<(.*)>.*') 1381 1382 def _find_non_whitespace(self, text, start): 1383 """Returns the index of the first non-whitespace character in text 1384 after (and including) start 1385 """ 1386 match = self._whitespace.match(text, start) 1387 return match.end() 1388 1389 def _find_balanced(self, text, start, open_c, close_c): 1390 """Returns the index where the open_c and close_c characters balance 1391 out - the same number of open_c and close_c are encountered - or the 1392 end of string if it's reached before the balance point is found. 1393 """ 1394 i = start 1395 l = len(text) 1396 count = 1 1397 while count > 0 and i < l: 1398 if text[i] == open_c: 1399 count += 1 1400 elif text[i] == close_c: 1401 count -= 1 1402 i += 1 1403 return i 1404 1405 def _extract_url_and_title(self, text, start): 1406 """Extracts the url and (optional) title from the tail of a link""" 1407 # text[start] equals the opening parenthesis 1408 idx = self._find_non_whitespace(text, start + 1) 1409 if idx == len(text): 1410 return None, None, None 1411 end_idx = idx 1412 has_anglebrackets = text[idx] == "<" 1413 if has_anglebrackets: 1414 end_idx = self._find_balanced(text, end_idx + 1, "<", ">") 1415 end_idx = self._find_balanced(text, end_idx, "(", ")") 1416 match = self._inline_link_title.search(text, idx, end_idx) 1417 if not match: 1418 return None, None, None