Remove non-ASCII characters from pandas column(从 pandas 列中删除非 ASCII 字符)
问题描述
我已经尝试解决这个问题一段时间了.我正在尝试从 DB_user 列中删除非 ASCII 字符并尝试用空格替换它们.但我不断收到一些错误.这是我的数据框的外观:
<前>+-----------------------------------------------------------|DB_user 源计数 |+-----------------------------------------------------------|???/"Ò|Z?)?]??C %??J A 10 ||?D$ZGU ;@D??_???T(?) B 3 ||?Q`H??M'?Y??KTK$?Ù‹???ЩJL4??*?_??C 2 |+-----------------------------------------------------------我正在使用这个功能,这是我在研究 SO 上的问题时遇到的.
def filter_func(string):对于范围内的 i (0,len(string)):如果 (ord(string[i])<32 或 ord(string[i])>126休息返回 ''然后使用 apply 函数:df['DB_user'] = df.apply(filter_func,axis=1)
我不断收到错误:
<前>'ord() 需要一个字符,但找到长度为 66 的字符串',你'出现在索引 2'
但是,我认为通过在 filter_func 函数中使用循环,我是通过将字符输入到 'ord' 来处理这个问题的.因此,在遇到非 ASCII 字符时,应将其替换为空格.
有人可以帮我吗?
谢谢!
您的代码失败,因为您没有将其应用于每个字符,而是按单词应用它,并且因为它需要单个字符而出现 ord 错误,您需要:
df['DB_user'] = df["DB_user"].apply(lambda x: ''.join([" " if ord(i) <32 or ord(i) > 126 elsei for i in x]))
您还可以使用链式比较来简化连接:
''.join([i if 32 < ord(i) < 126 else " " for i in x])
您也可以使用 string.printable
来过滤字符:
from string import 可打印st = 设置(可打印)df["DB_user"] = df["DB_user"].apply(lambda x: ''.join([" " if i not in st else i for i in x]))
最快的是使用translate:
from string import maketransdel_chars = " ".join(chr(i) for i in range(32) + range(127, 256))trans = maketrans(t, " "*len(del_chars))df['DB_user'] = df["DB_user"].apply(lambda s: s.translate(trans))
有趣的是,它比:
df['DB_user'] = df["DB_user"].str.translate(trans)
I have been trying to work on this issue for a while.I am trying to remove non ASCII characters form DB_user column and trying to replace them with spaces. But I keep getting some errors. This is how my data frame looks:
+----------------------------------------------------------- | DB_user source count | +----------------------------------------------------------- | ???/"Ò|Z?)?]??C %??J A 10 | | ?D$ZGU ;@D??_???T(?) B 3 | | ?Q`H??M'?Y??KTK$?Ù‹???ЩJL4??*?_?? C 2 | +-----------------------------------------------------------
I was using this function, which I had come across while researching the problem on SO.
def filter_func(string):
for i in range(0,len(string)):
if (ord(string[i])< 32 or ord(string[i])>126
break
return ''
And then using the apply function:
df['DB_user'] = df.apply(filter_func,axis=1)
I keep getting the error:
'ord() expected a character, but string of length 66 found', u'occurred at index 2'
However, I thought by using the loop in the filter_func function, I was dealing with this by inputing a char into 'ord'. Therefore the moment it hits a non-ASCII character, it should be replaced by a space.
Could somebody help me out?
Thanks!
You code fails as you are not applying it on each character, you are applying it per word and ord errors as it takes a single character, you would need:
df['DB_user'] = df["DB_user"].apply(lambda x: ''.join([" " if ord(i) < 32 or ord(i) > 126 else i for i in x]))
You can also simplify the join using a chained comparison:
''.join([i if 32 < ord(i) < 126 else " " for i in x])
You could also use string.printable
to filter the chars:
from string import printable
st = set(printable)
df["DB_user"] = df["DB_user"].apply(lambda x: ''.join([" " if i not in st else i for i in x]))
The fastest is to use translate:
from string import maketrans
del_chars = " ".join(chr(i) for i in range(32) + range(127, 256))
trans = maketrans(t, " "*len(del_chars))
df['DB_user'] = df["DB_user"].apply(lambda s: s.translate(trans))
Interestingly that is faster than:
df['DB_user'] = df["DB_user"].str.translate(trans)
这篇关于从 pandas 列中删除非 ASCII 字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持编程学习网!
本文标题为:从 pandas 列中删除非 ASCII 字符
- YouTube API v3 返回截断的观看记录 2022-01-01
- ";find_element_by_name(';name';)";和&QOOT;FIND_ELEMENT(BY NAME,';NAME';)";之间有什么区别? 2022-01-01
- CTR 中的 AES 如何用于 Python 和 PyCrypto? 2022-01-01
- 计算测试数量的Python单元测试 2022-01-01
- 我如何卸载 PyTorch? 2022-01-01
- 使用 Cython 将 Python 链接到共享库 2022-01-01
- 检查具有纬度和经度的地理点是否在 shapefile 中 2022-01-01
- 使用公司代理使Python3.x Slack(松弛客户端) 2022-01-01
- 我如何透明地重定向一个Python导入? 2022-01-01
- 如何使用PYSPARK从Spark获得批次行 2022-01-01