How To Escape The Whole String In Python?

October 03, 2023 Post a Comment

file=r'D:\tdx\vipdoc\szf10\300383.Txt' text=open(file,'r').read() The file can be read, but at first I wrote file as: file='D:\tdx\vipdoc\szf10\300383.Txt' I can't read it as tex

Solution 1:

Because your 'broken' filename doesn't actually contain \ characters, you cannot replace those characters either. You have a ASCII 9 TAB character, not the two separate characters \ and t:

>>>len('\t')
1
>>>'\' in '\t'
False

You'd have to try and 'repair' the broken string; this is not going to be foolproof, but you can create a replacement table to handle the common escape sequences. For filenames, which generally don't handle carriage returns, tabs or formfeed characters anyway, that's perfectly feasible.

Python string literals only support a limited number of one-letter \ escape sequences; see the Python string literal documentation:

\a  ASCII Bell (BEL)     
\b  ASCII Backspace (BS)     
\f  ASCII Formfeed (FF)  
\n  ASCII Linefeed (LF)  
\r  ASCII Carriage Return (CR)   
\t  ASCII Horizontal Tab (TAB)   
\v  ASCII Vertical Tab (VT)

I've omitted the multi-character sequences as these tend to error out when defining the literal. Simply replace these characters with the escaped sequence:

mapping = {'\a': r'\a', '\b': r'\b', '\f': r'\f', '\n': r'\n',
           '\r': r'\r', '\t': r'\t', '\v': r'\v'}

for char, escaped in mapping.items():
    filename = filename.replace(char, escaped)

Alternatively, we could map these characters with the 'string_escape' codec:

>>> '\t'.encode('string_escape')
'\\t'

You cannot apply this to the whole string as that would double up any properly escaped backslash however. Moreover, for many of the escape codes above, it'll use the \xhh escape sequence instead:

>>> '\a'.encode('string_escape')
'\\x07'

so this method isn't that suitable for your needs.

For characters encoded with \xhh, these are much harder to repair. Windows filesystems support Unicode codepoints just fine, for example. If you make the assumption than only ASCII codepoints are used then it becomes easier. You could use a regular expression to replace these with their 'escaped' version:

import re

filename = re.sub(r'[\x80-\xff]', lambda m: m.group().encode('string_escape'), filename)

This changes any byte outside of the ASCII range into a escape sequence:

>>>import re>>>re.sub(r'[\x80-\xff]', lambda m: m.group().encode('string_escape'), '\xc0')
'\\xc0'

With a carefully chosen range of characters, the above could also be applied to all non-printable ASCII characters, and repair most such broken filenames with one expression, provided we first apply the above mapping to replace codes that are not handled correctly by 'string_escape':

defrepair_filename(filename):
    mapping = {'\a': r'\a', '\b': r'\b', '\f': r'\f', '\v': r'\v'}
    for char, escaped in mapping.items():
        filename = filename.replace(char, escaped)
    filename = re.sub(r'[\x00-\x1f\x7f-\xff]', 
                      lambda m: m.group().encode('string_escape'),
                      filename)
    return filename

Demo on your sample input:

>>>defrepair_filename(filename):...    mapping = {'\a': r'\a', '\b': r'\b', '\f': r'\f', '\v': r'\v'}...for char, escaped in mapping.items():...        filename = filename.replace(char, escaped)...    filename = re.sub(r'[\x00-\x1f\x7f-\xff]', ...lambda m: m.group().encode('string_escape'),...                      filename)...return filename...>>>filename = 'D:\tdx\vipdoc\szf10\300383.Txt'>>>repair_filename(filename)
'D:\\tdx\\vipdoc\\szf10\\xc0383.Txt'

This should fixe most such broken filenames for you. It won't repair \x09 for example, because that's replaced by \\t as well.

It also cannot detect octal escape codes, nor repair them. Note that \300 was repaired as \xc0. This would require trial and error runs, trying out all possible combinations, or making assumptions about the input. You could assume \xhh never occurs but \ooo does, for example.

In that case the expression becomes:

Baca Juga

filename = re.sub(r'[\x00-\x1f\x7f-\xff]', lambda m: '\\{:o}'.format(ord(m.group())), filename)

Demo:

>>>defrepair_filename(filename):...    mapping = {'\a': r'\a', '\b': r'\b', '\f': r'\f', '\v': r'\v'}...for char, escaped in mapping.items():...        filename = filename.replace(char, escaped)...    filename = re.sub(r'[\x00-\x1f\x7f-\xff]', ...lambda m: '\\{:o}'.format(ord(m.group())),...                      filename)...return filename...>>>repair_filename(filename)
'D:\\11dx\\vipdoc\\szf10\\300383.Txt'

What works and doesn't depends a great deal on what kind of filenames you were expecting. More can be done if you know that the final part of the filename always ends with a 6 digit number, for example.

Best, however, would be to avoid corrupting the filenames altogether, of course.

Solution 2:

If you use '' instead of r'' you need to manually escape each backslash in your string literal:

filename = 'D:\\tdx\\vipdoc\\szf10\\300383.Txt'

Using r'' is simpler because it disables interpreting \ as an escape character, and thus \ itself doesn't have to be escaped when you just want it there as a literal slash.

Solution 3:

You generally can't because for instance 'D:\tdx', the \t is interpreted as a tab character. You however can try to convert the escaped chars into something that resembles the original string but that's way more work than writing that file name properly in the first place.

Solution 4:

I think, if you originally write it using the unescaped version, you will have some special characters in the filename. It will also be in the directory that the script originally ran in.

\t will come out as a tab character, \v as a vertical tab, \s is ok, and \300 as a high ascii character.

I suggest you run the following command in python:

import shutil
shutil.move('D:\tdx\vipdoc\szf10\300383.Txt',r'D:\tdx\vipdoc\szf10\300383.Txt')

Make sure you run it in the same directory that the script originally ran in. this should place it where you expect, with the file name you expect.

From then on, you can use the correct version.

Solution 5:

You don't need to do anything complicated, Python has built-in tools for dealing with this sort of problem, in particular, os.path.normpath. See this article for the implementation details.

Learn Python Tutorials