Commit Graph

65 Commits

Author SHA1 Message Date
fe6c0c74a7 Merge branch 'bugfix/2-fail-on-unavailable-resource' of snegov/nevernote into master 2019-11-09 14:40:52 +00:00
6f917578aa Fix failing on unavailable page resource 2019-11-09 17:33:58 +03:00
e6db3f9d1b Fix newlines inside div tag 2019-10-22 16:45:40 +03:00
3b6df3417a Fix link tag with missing rel attribute 2019-10-22 16:45:13 +03:00
e843abbc41 Fix python env string 2019-10-22 16:44:27 +03:00
89a8dd90cc Use BS4 for HTML parsing 2019-10-22 16:05:29 +03:00
3198361266 Add --skip-dups option 2019-10-22 14:39:36 +03:00
bdceede4f2 Rework fetching URLs from the file 2019-10-22 12:17:49 +03:00
91cddfab7c Refactor code 2019-10-22 12:17:49 +03:00
44b8a17841 Use requests library 2019-10-22 12:17:49 +03:00
Maks Snegov
56a7032b3e Merge branch 'fix_htmlparser_strict' 2016-03-10 19:21:48 +03:00
Maks Snegov
26e7176222 strict argument in html.parser.HTMLParser is removed since 3.5 2016-03-10 19:15:03 +03:00
Maks Snegov
edd12deb37 Merge branch 'devel' 2016-02-04 09:10:56 +03:00
Maks Snegov
1a6a7b3c9b Merge branch 'b64script' into devel 2014-10-04 11:08:41 -04:00
Maks Snegov
23f648e1ad limit filename length with 128 chars plus extension 2014-10-04 10:59:32 -04:00
Maks Snegov
c1724b5921 use base64 encoding for embedded scripts
can avoid some issues in browsers' renderers (habrahabr pages was broken
because of nested </script> in script content.
2014-10-04 03:38:34 +04:00
Maks Snegov
6b3aa602ef add script embedding 2014-10-04 03:24:38 +04:00
Maks Snegov
cf626546e7 use set of content-types for checking 2014-07-23 08:45:12 +04:00
Maks Snegov
fbf52e9544 add script parsing 2014-07-21 00:46:30 +04:00
Maks Snegov
7ce2bfb97f fix urllib.error.HTTPError print 2014-07-20 21:42:13 +04:00
Maks Snegov
41e984e1f0 fix urllib.error.HTTPError calls 2014-07-20 21:40:14 +04:00
Maks Snegov
fb3870e9dd skip http error pages 2014-07-20 17:31:43 +04:00
Maks Snegov
09346f4a70 fix: error with css charsets if no base charset 2014-07-20 17:31:15 +04:00
Maks Snegov
61d3d84a9c remove unused exception 2014-07-20 17:30:48 +04:00
Maks Snegov
b5ddae0ef8 fix css charset error, add urllib.error.httperror 2014-07-20 17:04:56 +04:00
Maks Snegov
964e79f97b add gzip encoding support 2014-07-20 14:03:49 +04:00
Maks Snegov
5c9d04cf3d use file with links as arguments 2014-07-20 13:48:18 +04:00
Maks Snegov
514b39d287 use default charset utf-8 if not set in headers 2014-07-20 13:31:20 +04:00
Maks Snegov
45f30ca9de fix: error with urls without scheme ('//ya.ru/index.html') 2014-07-20 13:30:22 +04:00
Maks Snegov
b58188b7b7 remove import 2014-07-20 13:29:56 +04:00
Maks Snegov
c523d025af add duplicate checking 2014-07-20 13:06:51 +04:00
Maks Snegov
a0fbb414a7 write url in the beginning of the file 2014-07-20 12:17:01 +04:00
Maks Snegov
716c61f6f1 replace http.client with urllib 2014-07-20 08:09:07 +04:00
Maks Snegov
eb2c43f438 ignore UTF-8 errors 2014-06-25 08:38:43 +04:00
Maks Snegov
6a818f4bb4 fix: error with empty GET urls 2014-06-23 00:50:21 +04:00
Maks Snegov
594ff71991 add css embedding 2014-06-22 23:51:18 +04:00
Maks Snegov
754411b6b7 remove unused header from request 2014-06-22 22:57:42 +04:00
Maks Snegov
a7ef8a8b7b separate complete_url function 2014-06-22 22:56:43 +04:00
Maks Snegov
35f755005d fix: do not work with GET arguments 2014-06-22 13:12:35 +04:00
Maks Snegov
fe69eff79b fix increment postfix in filenames 2014-06-22 12:38:05 +04:00
Maks Snegov
5c87f241d1 clean title from multiple whitespaces 2014-06-22 12:24:10 +04:00
Maks Snegov
ae63ca6318 skip connRefusedError pictures 2014-06-22 12:16:10 +04:00
Maks Snegov
36be68d78d fix title with attributes parsing 2014-06-22 11:59:02 +04:00
Maks Snegov
ab03e18ce2 fix relative urls 2014-06-22 11:48:04 +04:00
Maks Snegov
5b91bef896 add infinite redirects blocking 2014-06-22 11:47:21 +04:00
Maks Snegov
11de357865 add image embedding 2014-06-22 11:45:37 +04:00
Maks Snegov
5837451ed7 add url as comment to saved pages 2014-06-21 20:23:25 +04:00
Maks Snegov
e2009e7f08 skip fname duplicates 2014-06-21 20:09:15 +04:00
Maks Snegov
ab9a7e34c1 get title name 2014-06-21 09:58:47 +04:00
Maks Snegov
aead01258d remove never used if condition 2014-06-21 09:43:12 +04:00