Questions tagged with AWS Data Pipeline
Content language: English
Sort by most recent
I'm attempting to use AWS Data Pipeline to move a CSV file from my computer to AWS Data Lake as a parquet file. I'm unable to finad the exact template to select to migrate from my local computer.
please help me in choosing the source .

Hello, when we see the AWS data pipeline console we see a message that AWS is planning to remove access to console by 04/03/2023. So we are checking how can we use the data pipelines from AWS CLI commands. Please can you let me know what AWS CLI command can be used to check the error stack trace of a pipeline with status FAILED.
I have checked https://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-troubleshoot-locate-errors.html where the steps are described for checking error from AWS console.
I have also checked https://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-error-logs.html but from the pipelineLogUri we can only receive the Activity log but not the error stack trace.
Please help.
I am transforming my table by adding new columns using SQL Query Transform in AWS Glue Job studio.

SQL aliases- study
Existing Schema from data catalog - study id, patient id, patient age
I want to transform the existing schema by adding new columns.
new columns - AccessionNo
Transformed schema - study id, patient id, patient age, AccessionNo
SQL query - **alter table study add columns (AccessionNo int)**
Error it gives-
pyspark.sql.utils.AnalysisException: Invalid command: 'study' is a view not a table.; line 2 pos 0;
'AlterTable V2SessionCatalog(spark_catalog), default.study, 'UnresolvedV2Relation [study], V2SessionCatalog(spark_catalog), default.study, [org.apache.spark.sql.connector.catalog.TableChange$AddColumn@1e7cbfec]
I tried looking at AWS official doc for SQL transform and it says queries should be in Spark Sql syntax and my query is also in Spark Sql syntax.
https://docs.aws.amazon.com/glue/latest/ug/transforms-sql.html
What is the exact issue and please help me resolve.
Thanks
I want to run multiple python scripts to run one after another These are connected, what is the best way to do it ( want to use Ec2 Instance)
I am looking for a way to share my Terraform-based modern data stack solution, which I have created using various AWS services, with other companies privately via AWS. However, in the future, we might also consider using AWS CDK for infrastructure creation. Can anyone provide guidance on how I can accomplish this? Specifically, what are the best practices or methods for securely sharing this solution with other organizations while maintaining control over access and usage, and being able to adapt to different infrastructure creation tools in the future?
Where has the option of configuring request and response mapping template gone when creating a resolver for your data source? The way it was done in VTL?
I am attempting to deploy a Flask (python3.8) application via AWS Elastic Beanstalk (EB). I was able to successfully deploy the application on a --single EC2 instance configuration in the public subnet on my VPC. I have a backend data pipeline that generates a `serve.json` file, which contains metadata about the S3 prefix to use for serving up data on the frontend. `serve.json` is updated and overwritten everyday, and I have an S3 trigger on the prefix containing `serve.json` that is to call a Lambda function which will restart the WSGI web-app server managing the Flask app, which is running on the EC2 instances created by EB. After each restart, my Flask app reads in data from .parquet files on the S3 prefix specified in the updated `serve.json` and serves a RESTful app.
**Problem**: When I `restart-app-server` via boto3/Lambda, although the application restarts, my EB environment (`my-eb-environment`) gets degraded (`Health: Red`) with `cause` = `Incorrect application version found on all instances. Expected version n/a.`.
**CLI Route**: The CLI action works as expected. The effect of the action can be seen in the `eb health` console below:
```
$ aws elasticbeanstalk restart-app-server --environment-name "my-eb-environment"
```

**Lambda Route**: An S3 trigger calls the Lambda function below when `serve.json` is uploaded to the trigger prefix.
Lambda function:
```
from datetime import datetime
import boto3
def lambda_handler(event, context):
"""
restart the WSGI App Server running on EC2 instance(s) associated with an Elastic Beanstalk Environment
"""
print(str(datetime.now()))
EB_ENV_NAME = 'my-eb-environment'
try:
eb_client = boto3.client('elasticbeanstalk')
eb_response = eb_client.restart_app_server(EnvironmentName=EB_ENV_NAME)
print('SUCCESS! RESTARTED WEB SERVER FOR EnvironmentName {}'.format(EB_ENV_NAME))
response = 200
return (eb_response, response)
except Exception as e:
print('BAD REQUEST: COULD NOT RESTART WEB SERVER\nMESSAGE: {}'.format(e))
response = 400
return (None, response)
```
I can `describe-log-streams` and `get-log-events` to view logs of the Lambda function. It is clear that the app has refreshed:

But, `eb health` reveals that the environment is now degraded:

Running the CLI command on my terminal again refreshes the application server and makes the environment healthy:
```
$ aws elasticbeanstalk restart-app-server --environment-name "my-eb-environment"
```

1. How do I resolve the application version issue on the **Lambda route** to perform `restart-app-server` with this pipeline, so I can automate app refreshing with each uploaded `serve.json`?
2. Any alternative solutions for automated EB application refreshing based on S3 triggers are also appreciated. I do not wish to reboot the EC2 instances, because I would like to avoid the downtime if I can.
Hi all,
I saw this announcement today:
> Please note that Data Pipeline service is in maintenance mode and we are not planning to expand the service to new regions. We plan to remove console access by 02/28/2023.
But I cannot see it in any official AWS post.
And I wonder can I create new data-pipeline using CloudFormation.
Thank you guys for helping.
Hello , I need to purge all data from dynamodb table except the last 1 year of data .
I do not have a TTL attribute set in the table, what is the best approach to proceed?
IT will cost a lot if I write a TTL attribute for every record as per my knowledge although expiring items from TTL is free!
I have an attribute called "created_on_date" in the table though !
Hi All,
We have multiple sql servers installed on windows on prem environment, now plan is migrate those to AWS RDS.. We are planning to us below method.
1) Native import/Export using S3 and the restore
and 2) DMS
Now my question is aroung Native import/export .
Question:-
1) We dont have access to those sql server database servers, now if we try to take back up then its always storing on local machine where SQL Server is installed.
2) Now if we have to take backup of all these databases on remote server (Jump server) and then upload to s3 , then how we will be able to achieve this.
I'm a beginner in tasks of date engineer.
I have a taks to create a data lakehouse and i'm tryding undestand how to do it using these tools: DMS, S3, Glue and Hudi.
I already created a simple data lake and i don't have more dificults but to build a data lakehouse its very hard to my, because I couldn't find any simple example for this.
My enviroment is like this:
Database with Postgresql and I need to update the data daily from the data lake, but now an update will be done and no longer full copy.
Does AWS have an example to build it?