Importing excel files having variable headers(导入具有可变标题的 Excel 文件)
问题描述
我有 SSIS 包,它将把 excel 文件加载到数据库中.我已经创建了 Excel 源任务来将 excel 列名映射到数据库表列名并且它工作正常.
在极少数情况下,我们会收到带有一些空格的 excel 文件列名(例如:列名是ABC",但我们收到的是ABC"),这会导致映射问题和SSIS 失败了.
有没有办法不用打开excel就可以修整列名.
注意:页面名称是动态的,列位置可能会改变(例如:列ABC可能存在于第一行或第二行或......").
首先,我的解决方案是基于@DrHouseofSQL 和@Bhouse 的答案,所以你必须先阅读@DrHouseofSQL 的答案,然后再阅读@BHouse 的答案继续这个答案
问题
<块引用>注意:页面名称将是动态的,列位置可能会改变(例如:列ABC可能存在于第一行或第二行或......
这种情况有点复杂,可以使用以下解决方法解决:
解决方案概述
- 在导入数据的数据流任务前添加脚本任务
- 您必须使用脚本任务打开excel文件并获取工作表名称和标题行
- 构建查询并将其存储在变量中
- 在第二个数据流任务中,您必须使用上面存储的查询作为源(请注意,您必须将
Delay Validation
属性设置为 true)
解决方案详情
- 首先创建一个字符串类型的 SSIS 变量(即@[User::strQuery])
- 添加另一个包含 Excel 文件路径的变量(即@[User::ExcelFilePath])
- 添加一个脚本任务,选择
@[User::strQuery]
作为ReadWrite Variable,@[User::ExcelFilePath]
作为ReadOnly Variable(在脚本任务窗口中) - 将脚本语言设置为 VB.Net 并在脚本编辑器窗口中编写以下脚本:
注意:你必须导入System.Data.OleDb
在下面的代码中,我们搜索excel的前15行以找到标题,如果在15行之后可以找到标题,则可以增加数字.我还假设列范围是从 A
到 I
m_strExcelPath = Dts.Variables.Item("ExcelFilePath").Value.ToStringDim strSheetname As String = String.EmptyDim intFirstRow 作为整数 = 0m_strExcelConnectionString = Me.BuildConnectionString()尝试使用 OleDBCon 作为新的 OleDbConnection(m_strExcelConnectionString)如果 OleDBCon.State <>ConnectionState.Open 然后OleDBCon.Open()万一'获取所有工作表m_dtschemaTable = OleDBCon.GetOleDbSchemaTable(OleDbSchemaGuid.Tables,新对象(){没有,没有,没有,表"})'循环工作表以获得第一个(excel可能包含临时表或已删除的表对于每个 schRow 作为 DataRow 在 m_dtschemaTable.RowsstrSheetname = schRow("TABLE_NAME").ToString如果不是 strSheetname.EndsWith("_") AndAlso strSheetname.EndsWith("$") 然后Using cmd As New OleDbCommand("SELECT * FROM [" & strSheetname & "A1:I15]", OleDBCon)Dim dtTable As New DataTable("Table1")cmd.CommandType = 命令类型.文本使用 daGetDataFromSheet 作为新的 OleDbDataAdapter(cmd)daGetDataFromSheet.Fill(dtTable)对于 intCount 作为整数 = 0 到 15如果不是 String.IsNullOrEmpty(dtTable.Rows(intCount)(0).ToString) 然后'+1 因为数据表是从零开始索引的,+1 因为我们想从第二行开始intFirstRow = intCount + 2万一下一个结束使用If intFirstRow = 0 Then Throw New Exception("header not found")结束使用'当找到第一个正确的工作表时,无需检查其他工作表退出万一下一个OleDBCon.Close()结束使用Catch ex 作为例外抛出新异常(ex.Message, ex)结束尝试Dts.Variables.Item("strQuery").Value = "SELECT * FROM [" &strSheetname &一个"&intFirstRow.ToString &:一世]"Dts.TaskResult = ScriptResults.Success结束子
- 然后你必须添加一个Excel连接管理器,并选择你想要导入的excel文件(只需选择一个样本来定义元数据)
- 将
Select * from [Sheet1$A2:I]
的默认值赋给变量@[User::strQuery]
- 在数据流任务中添加一个 Excel 源,从变量中选择 SQL 命令,然后选择
@[User::strQuery]
- 转到列选项卡并按照@BHouse 建议的方式命名列
图片来自@BHouse 回答
- 将 DataFlow 任务
Delay Validation
属性设置为True
- 向 DataFlow 任务添加其他组件
更新 1:
来自 OP 评论:有时会出现带有空数据的 excel.(即)我们只有标题行而不是数据......在这种情况下它会失败整个任务
解决方案:
如果您的 excel 文件不包含数据(只有标题),您必须执行以下步骤:
- 添加一个布尔类型的 SSIS 变量 *(即
@[User::ImportFile]
) - 将
@[User::ImportFile]
添加到脚本任务ReadWrite变量 - 在脚本任务中检查文件是否包含行
- 如果是设置
@[User::ImportFile]
= True,否则@[User::ImportFile]
= False - 双击将脚本任务连接到 DataFlow 的箭头(优先约束)
- 将其类型设置为约束和表达式
写出下面的表达式
@[User::ImportFile] == True
注意:新的脚本任务代码是:
m_strExcelPath = Dts.Variables.Item("ExcelFilePath").Value.ToStringDim strSheetname As String = String.EmptyDim intFirstRow 作为整数 = 0m_strExcelConnectionString = Me.BuildConnectionString()尝试使用 OleDBCon 作为新的 OleDbConnection(m_strExcelConnectionString)如果 OleDBCon.State <>ConnectionState.Open 然后OleDBCon.Open()万一'获取所有工作表m_dtschemaTable = OleDBCon.GetOleDbSchemaTable(OleDbSchemaGuid.Tables,新对象(){没有,没有,没有,表"})'循环工作表以获得第一个(excel可能包含临时表或已删除的表对于每个 schRow 作为 DataRow 在 m_dtschemaTable.RowsstrSheetname = schRow("TABLE_NAME").ToString如果不是 strSheetname.EndsWith("_") AndAlso strSheetname.EndsWith("$") 然后Using cmd As New OleDbCommand("SELECT * FROM [" & strSheetname & "A1:I15]", OleDBCon)Dim dtTable As New DataTable("Table1")cmd.CommandType = 命令类型.文本使用 daGetDataFromSheet 作为新的 OleDbDataAdapter(cmd)daGetDataFromSheet.Fill(dtTable)对于 intCount 作为整数 = 0 到 15如果不是 String.IsNullOrEmpty(dtTable.Rows(intCount)(0).ToString) 然后'+1 因为数据表是从零开始索引的,+1 因为我们想从第二行开始intFirstRow = intCount + 2万一下一个结束使用结束使用'当找到第一个正确的工作表时,无需检查其他工作表退出万一下一个OleDBCon.Close()结束使用Catch ex 作为例外抛出新异常(ex.Message, ex)结束尝试如果 intFirstRow = 0 或 Else _intFirstRow >dtTable.Rows.Count 然后Dts.Variables.Item("ImportFile").Value = False别的Dts.Variables.Item("ImportFile").Value = True万一Dts.Variables.Item("strQuery").Value = "SELECT * FROM [" &strSheetname &一个"&intFirstRow.ToString &:一世]"Dts.TaskResult = ScriptResults.Success结束子
更新 2:
来自 OP 评论:是否有任何其他解决方法可以在不跳过所有数据流任务的情况下处理数据流任务,实际上其中一个任务将记录文件名和数据计数等所有内容,其中这里缺少
解决方案:
- 只需添加另一个数据流任务
- 使用另一个连接器和表达式
@[User::ImportFile] == False
将此数据流与脚本任务连接起来(与第一个连接器的步骤相同) - 在 DataFlow 任务中添加一个脚本组件作为源
- 创建要导入日志的输出列
- 创建一个包含您需要导入的信息的行
- 添加日志目标
或者除了添加另一个Data Flow Task
,你可以添加一个Execute SQL Task
在日志表中插入一行>
I have the SSIS package, which will load the excel file into Database. I have created Excel Source task to map the excel column name to Database table column name and its working fine.
In rare case, We are receiving the excel file column name with some space (for example : Column name is "ABC" but we are receiving "ABC ") and which cause the mapping issue and SSIS got failed.
Is there any possible to trim the column name without opening the excel.
Note : Page name will be dynamic and Column position may change (eg: Column "ABC may exist in first row or second row or ..").
First of all, my solution is based on @DrHouseofSQL and @Bhouse answers, so you have to read @DrHouseofSQL answer first then @BHouse answer then continue with this answer
Problem
Note : Page name will be dynamic and Column position may change (eg: Column "ABC may exist in first row or second row or ...
This situation is a little complex and can be solved using the following workaround:
Solution Overview
- Add a script task before the data flow task that import the data
- You have to use the script task to open the excel file and get the Worksheet name and the header row
- Build the Query and store it in a variable
- in the second Data Flow task you have to use the query stored above as source (Note that you have to set
Delay Validation
property to true)
Solution Details
- First create an SSIS variable of type string (i.e. @[User::strQuery])
- Add another variable that contains the Excel File Path (i.e. @[User::ExcelFilePath])
- Add A Script Task, and select
@[User::strQuery]
as ReadWrite Variable, and@[User::ExcelFilePath]
as ReadOnly Variable (in the script task window) - Set the Script Language to VB.Net and in the script editor window write the following script:
Note: you have to imports System.Data.OleDb
In the code below, we search the excel first 15 rows to find the header, you can increase the number if the header can be found after the 15 rows. Also i assumed that the columns range is from A
to I
m_strExcelPath = Dts.Variables.Item("ExcelFilePath").Value.ToString
Dim strSheetname As String = String.Empty
Dim intFirstRow As Integer = 0
m_strExcelConnectionString = Me.BuildConnectionString()
Try
Using OleDBCon As New OleDbConnection(m_strExcelConnectionString)
If OleDBCon.State <> ConnectionState.Open Then
OleDBCon.Open()
End If
'Get all WorkSheets
m_dtschemaTable = OleDBCon.GetOleDbSchemaTable(OleDbSchemaGuid.Tables,
New Object() {Nothing, Nothing, Nothing, "TABLE"})
'Loop over work sheet to get the first one (the excel may contains temporary sheets or deleted ones
For Each schRow As DataRow In m_dtschemaTable.Rows
strSheetname = schRow("TABLE_NAME").ToString
If Not strSheetname.EndsWith("_") AndAlso strSheetname.EndsWith("$") Then
Using cmd As New OleDbCommand("SELECT * FROM [" & strSheetname & "A1:I15]", OleDBCon)
Dim dtTable As New DataTable("Table1")
cmd.CommandType = CommandType.Text
Using daGetDataFromSheet As New OleDbDataAdapter(cmd)
daGetDataFromSheet.Fill(dtTable)
For intCount As Integer = 0 To 15
If Not String.IsNullOrEmpty(dtTable.Rows(intCount)(0).ToString) Then
'+1 because datatable is zero based indexed, +1 because we want to start from the second row
intFirstRow = intCount + 2
End If
Next
End Using
If intFirstRow = 0 Then Throw New Exception("header not found")
End Using
'when the first correct sheet is found there is no need to check others
Exit For
End If
Next
OleDBCon.Close()
End Using
Catch ex As Exception
Throw New Exception(ex.Message, ex)
End Try
Dts.Variables.Item("strQuery").Value = "SELECT * FROM [" & strSheetname & "A" & intFirstRow.ToString & ":I]"
Dts.TaskResult = ScriptResults.Success
End Sub
- Then you have to add an Excel connection manager, and choose the excel file that you want to import (just select a sample to define the metadata for the first time only)
- Assign a default value of
Select * from [Sheet1$A2:I]
to the variable@[User::strQuery]
- In the Data Flow Task add an Excel Source, choose SQL Command from variable, and select
@[User::strQuery]
- Go to the columns tab and name the columns in the same way that @BHouse suggested
Image taken from @BHouse answer
- Set the DataFlow Task
Delay Validation
property toTrue
- Add other components to DataFlow Task
UPDATE 1:
From the OP comments: sometimes excel with empty data will come.(i.e) we have only header row not not data... in that case it fails entire task
Solution:
If your excel file contains no data (only header) you have to do these steps:
- Add an SSIS variable of type boolean *(i.e.
@[User::ImportFile]
) - Add
@[User::ImportFile]
to the script task ReadWrite variables - In the Script Task check if the file contains rows
- If yes Set
@[User::ImportFile]
= True, else@[User::ImportFile]
= False - Double Click on the arrow (precedence constraint) that connect the script task to the DataFlow
- Set its type to Constraint and Expression
Write the following expression
@[User::ImportFile] == True
Note: The new Script Task code is:
m_strExcelPath = Dts.Variables.Item("ExcelFilePath").Value.ToString
Dim strSheetname As String = String.Empty
Dim intFirstRow As Integer = 0
m_strExcelConnectionString = Me.BuildConnectionString()
Try
Using OleDBCon As New OleDbConnection(m_strExcelConnectionString)
If OleDBCon.State <> ConnectionState.Open Then
OleDBCon.Open()
End If
'Get all WorkSheets
m_dtschemaTable = OleDBCon.GetOleDbSchemaTable(OleDbSchemaGuid.Tables,
New Object() {Nothing, Nothing, Nothing, "TABLE"})
'Loop over work sheet to get the first one (the excel may contains temporary sheets or deleted ones
For Each schRow As DataRow In m_dtschemaTable.Rows
strSheetname = schRow("TABLE_NAME").ToString
If Not strSheetname.EndsWith("_") AndAlso strSheetname.EndsWith("$") Then
Using cmd As New OleDbCommand("SELECT * FROM [" & strSheetname & "A1:I15]", OleDBCon)
Dim dtTable As New DataTable("Table1")
cmd.CommandType = CommandType.Text
Using daGetDataFromSheet As New OleDbDataAdapter(cmd)
daGetDataFromSheet.Fill(dtTable)
For intCount As Integer = 0 To 15
If Not String.IsNullOrEmpty(dtTable.Rows(intCount)(0).ToString) Then
'+1 because datatable is zero based indexed, +1 because we want to start from the second row
intFirstRow = intCount + 2
End If
Next
End Using
End Using
'when the first correct sheet is found there is no need to check others
Exit For
End If
Next
OleDBCon.Close()
End Using
Catch ex As Exception
Throw New Exception(ex.Message, ex)
End Try
If intFirstRow = 0 OrElse _
intFirstRow > dtTable.Rows.Count Then
Dts.Variables.Item("ImportFile").Value = False
Else
Dts.Variables.Item("ImportFile").Value = True
End If
Dts.Variables.Item("strQuery").Value = "SELECT * FROM [" & strSheetname & "A" & intFirstRow.ToString & ":I]"
Dts.TaskResult = ScriptResults.Success
End Sub
UPDATE 2:
From the OP comments: is there any other work around available to process the data flow task without skipping all data flow task,Actually one of the task will log the filename and data count and all, which are missing here
Solution:
- Just add another DATA FLOW task
- Connect this dataflow with the script task using another connector and with the expression
@[User::ImportFile] == False
(same steps of the first connector) - In the DataFlow Task add a SCript Component as a Source
- Create the Output columns you want to import to Logs
- Create a Row that contains the information you need to import
- Add the Log Destination
Or Instead of adding another Data Flow Task
, you can add an Execute SQL Task
to insert a row in the Log Table
这篇关于导入具有可变标题的 Excel 文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持编程学习网!
本文标题为:导入具有可变标题的 Excel 文件
- 导入具有可变标题的 Excel 文件 2021-01-01
- 在SQL中,如何为每个组选择前2行 2021-01-01
- 如何将 SonarQube 6.7 从 MySQL 迁移到 postgresql 2022-01-01
- 以一个值为轴心,但将一行上的数据按另一行分组? 2022-01-01
- 远程 mySQL 连接抛出“无法使用旧的不安全身份验证连接到 MySQL 4.1+"来自 XAMPP 的错误 2022-01-01
- 如何使用 pip 安装 Python MySQLdb 模块? 2021-01-01
- 使用 Oracle PL/SQL developer 生成测试数据 2021-01-01
- 如何将 Byte[] 插入 SQL Server VARBINARY 列 2021-01-01
- 更改自动增量起始编号? 2021-01-01
- SQL 临时表问题 2022-01-01