You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I've installed JusText on a Windows 2012 Server machine and it seems to be running fine overall. However, about 30-40% of the HTML files crash because of encoding issues. The error message I get it:
File "c:\python32\lib\encodings\cp1252.py", line 23, in decode return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 20410: character maps to <undefined>
and then the TXT file is empty for the HTML file that I'm trying to do JusText on.
I've installed JusText on a Windows 2012 Server machine and it seems to be running fine overall. However, about 30-40% of the HTML files crash because of encoding issues. The error message I get it:
and then the TXT file is empty for the HTML file that I'm trying to do JusText on.
An example of a page that is causing it to crash: http://www.democracynow.org/2012/7/6/peru_declares_state_of_emergency_as (byte position 20410, the word GONZÁLEZ). I've saved a copy of the file that I'm trying to do JusText on at:
I've tried every possible combination of
--encoding=...
--enc-force
--enc-errors=...
as well as every possible encoding on the files, and it's still crashing on these files. Any suggestions?
Thanks so much for your help.
Mark Davies, mark_davies (at) byu.edu
Professor of Linguistics / Brigham Young University
http://davies-linguistics.byu.edu/
** Corpus design and use // Linguistic databases **
** Historical linguistics // Language variation **
** English, Spanish, and Portuguese **
The text was updated successfully, but these errors were encountered: