TIL xmllint can interpret XPath expressions, and has a html parser. No more ugly frankensed expressions to deals with trees. Ahhh DSLs.
dummy@x60s_GPT ~ % for page in $(hrefs URL | egrep $(basename URL) | sort | uniq) ;
do
curl -sL ${page}
| xmllint --html --xpath '//*[@id="content"]' -
| html2text;
done | less
where hrefs is, note the old school sedism which will soon be deprecated:
dummy@x60s_GPT ~ % cat $(which hrefs)
#!/usr/bin/env dash
URL="${1}"
curl -sL ${URL} | sed 's.>.>\n.g' | sed -n '/href/I s@^.*href="\([^"]\+\)".*$@\1@Igp'
ps: no need to criticize my fault-tolerantless style; I'm still waiting for a whole lisp user-space so why bother...
Aucun commentaire:
Enregistrer un commentaire