How to correct the spaces between the columns while reading a text file?(如何在阅读文本文件时纠正列之间的空格?)
问题描述
我想从文本文件中读取数据并将其写入hdf5格式。但不知何故,在数据文件的中间,列之间的空格消失了。small part of the file数据如下:Generated by trjconv : P/L=1/400 t= 0.00000
11214
1P1 aP1 1 80.48 35.36 4.25
2P1 aP1 2 37.45 3.92 3.96
3P2 aP2 3 18.53 -9.69 4.68
4P2 aP2 4 55.39 74.34 4.60
5P3 aP3 5 22.11 68.71 3.85
.
.
9994LI aLI 9994 24.60 41.14 5.32
9995LI aLI 9995 88.47 43.02 5.72
9996LI aLI 9996 18.98 40.60 5.56
9997LI aLI 9997 35.63 46.43 5.68
9998LI aLI 9998 33.81 52.15 5.41
9999LI aLI 9999 38.72 57.18 5.32
10000LI aLI10000 29.36 47.12 5.55
10001LI aLI10001 82.55 44.80 5.50
10002LI aLI10002 42.52 51.00 5.19
10003LI aLI10003 28.61 40.21 5.70
10004LI aLI10004 38.16 42.85 5.33
Generated by trjconv : P/L=1/400 t= 1000.00
11214
1P1 aP1 1 80.48 35.36 4.25
2P1 aP1 2 37.45 3.92 3.96
3P2 aP2 3 18.53 -9.69 4.68
4P2 aP2 4 55.39 74.34 4.60
5P3 aP3 5 22.11 68.71 3.85
.
.
9994LI aLI 9994 24.60 41.14 5.32
9995LI aLI 9995 88.47 43.02 5.72
9996LI aLI 9996 18.98 40.60 5.56
9997LI aLI 9997 35.63 46.43 5.68
9998LI aLI 9998 33.81 52.15 5.41
9999LI aLI 9999 38.72 57.18 5.32
10000LI aLI10000 29.36 47.12 5.55
10001LI aLI10001 82.55 44.80 5.50
10002LI aLI10002 42.52 51.00 5.19
10003LI aLI10003 28.61 40.21 5.70
10004LI aLI10004 38.16 42.85 5.33
..
..
..
数据是t=1000的帧的集合,具有一百万个帧。如您所见,在框的末尾,第2列和第3列互相接触。我想在读取数据的同时在它们之间创造空间。我遇到的另一个问题是由..生成的重复标题。由于h5文件不支持字符串,我如何将它们读写到hdf5文件中?有没有办法手动添加它们?代码如下:
import h5py
import numpy as np
#define a np.dtype for gro array/dataset (hard-coded for now)
gro_dt = np.dtype([('col1', 'S4'), ('col2', 'S4'), ('col3', int),
('col4', float), ('col5', float), ('col6', float)])
# Next, create an empty .h5 file with the dtype
with h5py.File('xaa.h5', 'w') as hdf:
ds= hdf.create_dataset('dataset1', dtype=gro_dt, shape=(20,), maxshape=(None,))
# Next read line 1 of .gro file
f = open('xaa', 'r')
data = f.readlines()
ds.attrs["Source"]=data[0]
f.close()
# loop to read rows from 2 until end
skip, incr, row0 = 2, 20, 0
read_gro = True
while read_gro:
arr = np.genfromtxt('xaa', skip_header=skip, max_rows=incr, dtype=gro_dt)
rows = arr.shape[0]
if rows == 0:
read_gro = False
else:
if row0+rows > ds.shape[0] :
ds.resize((row0+rows,))
ds[row0:row0+rows] = arr
skip += rows
row0 += rows
我可以跳过第一个标头,但如何处理即将到来的标头?我可以提供行数的标题,如果有人需要。这些列引发值Error
ValueError: Some errors were detected !
Line #7 (got 5 columns instead of 6)
Line #8 (got 5 columns instead of 6)
Line #9 (got 5 columns instead of 6)
推荐答案
答案更新2021-09-09:
根据注释中的请求,我添加了两个使用f.readline()
的新方法。一个使用索引对行进行切片,另一个使用struct
包对字段进行解包。struct
应该更快,但我没有看到测试文件的性能有显著差异(75个时间步长)。
此外,我还修改了代码,使其在文件末尾使用while True:
和Break进行循环。这样就无需输入时间步数。
这是我根据您对上一个问题的回答所遇到的问题而编写的答案。(参考文献:Reading data from gromacs file and write it to the hdf5 file format。此答案使用readlines()
将数据读取到列表中。(这可能是您的大文件的一个问题。如果是,则可以使用readline()
将解决方案修改为逐行阅读。)它使用与字段宽度对齐的索引对每行上的数据进行切片。警告:阅读50e6行可能需要一段时间。注意:HDF5支持字符串(但h5py不支持NumPy Unicode字符串)。
通过使用索引对每行进行切片来获取值:
import h5py
import numpy as np
csv_file = 'xaa.txt' # data from link in question
# define a np.dtype for gro array/dataset (hard-coded for now)
gro_dt = np.dtype([('col1', 'S7'), ('col2', 'S8'), ('col3', int),
('col4', float), ('col5', float), ('col6', float)])
c1, c2, c3, c4, c5 = 7, 15, 20, 27, 34
# The values above are used as indices to slice line
# into the following fields in the loop on data[]:
# [:7], [7:15], [15:20], [20:27], [27:34], [34:]
# Open the file for reading and
# create an empty .h5 file with the dtype above
with open(csv_file, 'r') as f,
h5py.File('xaa.h5', 'w') as hdf:
data = f.readlines()
skip = 0
step = 0
while True:
# Read text header line for THIS time step
if skip == len(data):
print("End Of File")
break
else:
header = data[skip]
print(header)
skip += 1
# get number of data rows
no_rows = int(data[skip])
skip += 1
arr = np.empty(shape=(no_rows,), dtype=gro_dt)
for row, line in enumerate(data[skip:skip+no_rows]):
arr[row]['col1'] = line[:c1].strip()
arr[row]['col2'] = line[c1:c2].strip()
arr[row]['col3'] = int(line[c2:c3])
arr[row]['col4'] = float(line[c3:c4])
arr[row]['col5'] = float(line[c4:c5])
arr[row]['col6'] = float(line[c5:])
if arr.shape[0] > 0:
# create a dataset for THIS time step
ds= hdf.create_dataset(f'dataset_{step:04}', data=arr)
#create attributes for this dataset / time step
hdr_tokens = header.split()
ds.attrs['raw_header'] = header
ds.attrs['Generated by'] = hdr_tokens[2]
ds.attrs['P/L'] = hdr_tokens[4].split('=')[1]
ds.attrs['Time'] = hdr_tokens[6]
# increment by rows plus footer line that follows
skip += 1 + no_rows
方法二:使用f.readline()
逐行阅读
通过使用索引对每行进行切片来获取值:
import h5py
import numpy as np
csv_file = 'xaa.txt'
#define a np.dtype for gro array/dataset (hard-coded for now)
gro_dt = np.dtype([('col1', 'S7'), ('col2', 'S8'), ('col3', int),
('col4', float), ('col5', float), ('col6', float)])
## gro_fmt=[0:7], [7:15], [15:20], [20:27], [27:34], [34:41]
c1, c2, c3, c4, c5 = 7, 15, 20, 27, 34
# Open the file for reading and
# create an empty .h5 file with the dtype above
with open(csv_file, 'r') as f,
h5py.File('xaa.h5', 'w') as hdf:
step = 0
while True:
# Read text header line for THIS time step
header = f.readline()
if not header:
print("End Of File")
break
else:
print(header)
# get number of data rows
no_rows = int(f.readline())
arr = np.empty(shape=(no_rows,), dtype=gro_dt)
for row in range(no_rows):
line = f.readline()
arr[row]['col1'] = line[:c1].strip()
arr[row]['col2'] = line[c1:c2].strip()
arr[row]['col3'] = int(line[c2:c3])
arr[row]['col4'] = float(line[c3:c4])
arr[row]['col5'] = float(line[c4:c5])
arr[row]['col6'] = float(line[c5:])
if arr.shape[0] > 0:
# create a dataset for THIS time step
ds= hdf.create_dataset(f'dataset_{step:04}', data=arr)
#create attributes for this dataset / time step
print(header)
hdr_tokens = header.split()
ds.attrs['raw_header'] = header
ds.attrs['Generated by'] = hdr_tokens[2]
ds.attrs['P/L'] = hdr_tokens[4].split('=')[1]
ds.attrs['Time'] = hdr_tokens[6]
footer = f.readline()
step += 1
方法3:使用f.readlines()
逐行阅读
使用struct
Package从每行解包:
import struct
import numpy as np
import h5py
csv_file = 'xaa.txt'
fmtstring = '7s 8s 5s 7s 7s 7s'
fieldstruct = struct.Struct(fmtstring)
parse = fieldstruct.unpack_from
#define a np.dtype for gro array/dataset (hard-coded for now)
gro_dt = np.dtype([('col1', 'S7'), ('col2', 'S8'), ('col3', int),
('col4', float), ('col5', float), ('col6', float)])
with open(csv_file, 'r') as f,
h5py.File('xaa.h5', 'w') as hdf:
step = 0
while True:
header = f.readline()
if not header:
print("End Of File")
break
else:
print(header)
# get number of data rows
no_rows = int(f.readline())
arr = np.empty(shape=(no_rows,), dtype=gro_dt)
for row in range(no_rows):
fields = parse( f.readline().encode('utf-8') )
arr[row]['col1'] = fields[0].strip()
arr[row]['col2'] = fields[1].strip()
arr[row]['col3'] = int(fields[2])
arr[row]['col4'] = float(fields[3])
arr[row]['col5'] = float(fields[4])
arr[row]['col6'] = float(fields[5])
if arr.shape[0] > 0:
# create a dataset for THIS time step
ds= hdf.create_dataset(f'dataset_{step:04}', data=arr)
#create attributes for this dataset / time step
hdr_tokens = header.split()
ds.attrs['raw_header'] = header
ds.attrs['Generated by'] = hdr_tokens[2]
ds.attrs['P/L'] = hdr_tokens[4].split('=')[1]
ds.attrs['Time'] = hdr_tokens[6]
footer = f.readline()
step += 1
这篇关于如何在阅读文本文件时纠正列之间的空格?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持编程学习网!
本文标题为:如何在阅读文本文件时纠正列之间的空格?
- 获取多索引中某个级别的最后一个元素 2022-09-22
- 如何防止Groupby超越指数? 2022-09-22
- H5py:如何在HDF5组和数据集上使用key()循环 2022-09-21
- 为什么切换屏幕在Kivy中不起作用? 2022-09-21
- 如何将属性添加到作为组存储在HDF5文件中的 pa 2022-09-21
- 使用带有CROSS_VAL_SCORE的自定义估计器失败 2022-09-21
- 如何命名HDF5数据集中的列? 2022-09-21
- 将文件从Azure文件加载到Azure数据库 2022-09-21
- 合并具有多个索引和列值的数据帧 2022-09-22
- 基于多个一级列的子集多索引DataFrame 2022-09-22