I have many text files include data as follow:
350.0 2.1021 0.0000 1.4769 0.0000
357.0 2.0970 0.0000 1.4758 0.0000
364.0 2.0920 0.0000 1.4747 0.0000
371.0 2.0874 0.0000 1.4737 0.0000
I need to give each column a specific name (Ex:a,b,c,d,e)
a b c d e
350.0 2.1021 0.0000 1.4769 0.0000
357.0 2.0970 0.0000 1.4758 0.0000
364.0 2.0920 0.0000 1.4747 0.0000
371.0 2.0874 0.0000 1.4737 0.0000
After that I will start to split columns and use them separately
I wrote this code
import glob
import pandas as pd
input_files = glob.glob('input/*.txt')
for file_name in input_files:
data = pd.read_csv(file_name)
columns_list = ["a", "b", "c","d", "e"]
data_list = pd.DataFrame(data,columns=columns_list)
print(data_list)
the result is
a b c d e
0 NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN
Could you please help me?
You can define columns while reading from CSV file.
data = pd.read_csv(file_name, names=columns_list)
Related
I have this very simple problem which I can't figure out in Python. I have three columns in a dataset. The the first is made of integers (from 0 to 19), the second is made of dates in Y/M/D format, and the third is made of numbers ranging from negative to positive values (mostly 0s, but 200 negative and positive values in total overall).
My dataset looks like that:
Groups date values
0 2020-02-22 0.0000
2020-02-23 0.0000
2020-02-26 0.0000
2020-03-28 0.0000
2020-04-13 1.3433
2020-04-14 0.0000
2020-04-15 0.0000
2020-04-16 0.0000
2020-04-17 -1.3933
2020-04-28 0.0000
2020-05-31 0.0000
2020-06-15 0.0000
2020-08-02 0.0000
1 2020-02-21 0.0000
2020-02-22 0.0000
2020-02-23 0.0000
2020-02-24 0.0000
2020-02-25 0.0000
2020-04-29 0.0000
2020-06-01 0.4404
2020-06-02 0.4404
2020-06-07 0.0000
2 2020-02-22 0.0000
2020-02-23 0.0000
2020-02-24 0.0000
2020-02-28 0.0000
2020-03-01 0.0000
2020-03-07 0.0000
2020-03-08 0.0000
2020-03-14 0.0000
I want to plot curves grouped by column Groups, with the dates on the x axis and the third column ("values") on the y axis. In other words, I want a curve for each of the 20 possible groups (0 to 19) which goes up/down depending on the values of the third column, "value" (the 0s, positive, and negative numbers), all the while keeping the dates on the x axis.
I know how to do this very easily with ggplot on R but this project is all Python based and for some reason I just can't find how to do this there.
Thanks for the help.
It looks like Groups and date are the two levels of your dataframe's index. In which casem You can do:
df['values'].unstack('Groups').plot()
Am concatenating my two dataframes base_df and base_df1 with base_df having product id and base_df1 as sales, profit and discount.
base_df1
sales profit discount
0 0.050090 0.000000 0.262335
1 0.110793 0.000000 0.260662
2 0.309561 0.864121 0.241432
3 0.039217 0.591474 0.260687
4 0.070205 0.000000 0.263628
base_df['Product ID']
0 FUR-ADV-10000002
1 FUR-ADV-10000108
2 FUR-ADV-10000183
3 FUR-ADV-10000188
4 FUR-ADV-10000190
final_df=pd.concat([base_df1,base_df], axis=0, ignore_index=True,sort=False)
But my final_df.head() having NaN values in product_id column, what might be the issue.
sales Discount profit product id
0 0.050090 0.000000 0.262335 NaN
1 0.110793 0.000000 0.260662 NaN
2 0.309561 0.864121 0.241432 NaN
3 0.039217 0.591474 0.260687 NaN
4 0.070205 0.000000 0.263628 NaN
Try using axis=1:
final_df=pd.concat([base_df1,base_df], axis=1, sort=False)
Output:
sales profit discount ProductID
0 0.050090 0.000000 0.262335 FUR-ADV-10000002
1 0.110793 0.000000 0.260662 FUR-ADV-10000108
2 0.309561 0.864121 0.241432 FUR-ADV-10000183
3 0.039217 0.591474 0.260687 FUR-ADV-10000188
4 0.070205 0.000000 0.263628 FUR-ADV-10000190
With axis=0 you are stacking the dataframes vertically and with pandas using intrinsic data alignment, meaning aligning the data with the indexes, you are generating the following dataframe:
final_df=pd.concat([base_df1,base_df], axis=0, sort=False)
sales profit discount ProductID
0 0.050090 0.000000 0.262335 NaN
1 0.110793 0.000000 0.260662 NaN
2 0.309561 0.864121 0.241432 NaN
3 0.039217 0.591474 0.260687 NaN
4 0.070205 0.000000 0.263628 NaN
0 NaN NaN NaN FUR-ADV-10000002
1 NaN NaN NaN FUR-ADV-10000108
2 NaN NaN NaN FUR-ADV-10000183
3 NaN NaN NaN FUR-ADV-10000188
4 NaN NaN NaN FUR-ADV-10000190
i use the following code to build and prepare my pandas dataframe:
data = pd.read_csv('statistic.csv',
parse_dates=True, index_col=['DATE'], low_memory=False)
data[['QUANTITY']] = data[['QUANTITY']].apply(pd.to_numeric, errors='coerce')
data_extracted = data.groupby(['DATE','ARTICLENO'])
['QUANTITY'].sum().unstack()
#replace string nan with numpy data type
data_extracted = data_extracted.fillna(value=np.nan)
#remove footer of csv file
data_extracted.index = pd.to_datetime(data_extracted.index.str[:-2],
errors="coerce")
#resample to one week rythm
data_resampled = data_extracted.resample('W-MON', label='left',
loffset=pd.DateOffset(days=1)).sum()
# reduce to one year
data_extracted = data_extracted.loc['2015-01-01' : '2015-12-31']
#fill possible NaNs with 1 (not 0, because of division by zero when doing
pct_change
data_extracted = data_extracted.replace([np.inf, -np.inf], np.nan).fillna(1)
data_pct_change =
data_extracted.astype(float).pct_change(axis=0).replace([np.inf, -np.inf],
np.nan).fillna(0)
# actual dropping logic if column has no values at all
data_pct_change.drop([col for col, val in data_pct_change.sum().iteritems()
if val == 0 ], axis=1, inplace=True)
normalized_modeling_data = preprocessing.normalize(data_pct_change,
norm='l2', axis=0)
normalized_data_headers = pd.DataFrame(normalized_modeling_data,
columns=data_pct_change.columns)
normalized_modeling_data = normalized_modeling_data.transpose()
kmeans = KMeans(n_clusters=3, random_state=0).fit(normalized_modeling_data)
print(kmeans.labels_)
np.savetxt('log_2016.txt', kmeans.labels_, newline="\n")
for i, cluster_center in enumerate(kmeans.cluster_centers_):
plp.plot(cluster_center, label='Center {0}'.format(i))
plp.legend(loc='best')
plp.show()
Unfurtunately there are a lot of 0's in my dataframe (the articles don't start at the same date, so so if A starts in 2015 and B starts in 2016, B will get 0 through the whole year 2015)
Here is the grouped dataframe:
ARTICLENO 205123430604 205321436644 405659844106 305336746308
DATE
2015-01-05 9.0 6.0 560.0 2736.0
2015-01-19 2.0 1.0 560.0 3312.0
2015-01-26 NaN 5.0 600.0 2196.0
2015-02-02 NaN NaN 40.0 3312.0
2015-02-16 7.0 6.0 520.0 5004.0
2015-02-23 12.0 4.0 480.0 4212.0
2015-04-13 11.0 6.0 920.0 4230.0
And here the corresponding percentage change:
ARTICLENO 205123430604 205321436644 405659844106 305336746308
DATE
2015-01-05 0.000000 0.000000 0.000000 0.000000
2015-01-19 -0.777778 -0.833333 0.000000 0.210526
2015-01-26 -0.500000 4.000000 0.071429 -0.336957
2015-02-02 0.000000 -0.800000 -0.933333 0.508197
2015-02-16 6.000000 5.000000 12.000000 0.510870
2015-02-23 0.714286 -0.333333 -0.076923 -0.158273
The factor 12 at 405659844106 is 'correct'
Here is another example from my dataframe:
ARTICLENO 305123446353 205423146377 305669846421 905135949255
DATE
2015-01-05 2175.0 200.0 NaN NaN
2015-01-19 2550.0 NaN NaN NaN
2015-01-26 925.0 NaN NaN NaN
2015-02-02 675.0 NaN NaN NaN
2015-02-16 1400.0 200.0 120.0 NaN
2015-02-23 6125.0 320.0 NaN NaN
And the corresponding percentage change:
ARTICLENO 305123446353 205423146377 305669846421 905135949255
DATE
2015-01-05 0.000000 0.000000 0.000000 0.000000
2015-01-19 0.172414 -0.995000 0.000000 -0.058824
2015-01-26 -0.637255 0.000000 0.000000 0.047794
2015-02-02 -0.270270 0.000000 0.000000 -0.996491
2015-02-16 1.074074 199.000000 119.000000 279.000000
2015-02-23 3.375000 0.600000 -0.991667 0.310714
As you can see, there are changes of factor 200-300 which comefrome the change of the replaced NaN to a real value.
This data is used to do a kmeans-clustering and such 'nonsense'-data ruins my kmeans-centers.
Does anyone have an idea how to remove such columns?
I used the following statement to drop the nonsense columns:
max_nan_value_count = 5
data_extracted = data_extracted.drop(data_extracted.columns[data_extracted.apply(lambda
col: col.isnull().sum() > max_nan_value_count)], axis=1)
Let us assume this is my DataFrame
City State Country
Name
A NYC NaN NaN
B NaN NaN USA
C NYC NY NaN
D 601009 NaN NaN
E NYC AZ NaN
F 000001 NaN NaN
G NaN NaN NaN
How do I get hold of rows that have NaNs, both in State and Country.
I'm looking for the following output
City State Country
Name
A NYC NaN NaN
D 601009 NaN NaN
F 000001 NaN NaN
G NaN NaN NaN
Thanks a bunch!
use isnull:
In [133]: wd[wd['Country'].isnull() & wd['State'].isnull()]
Out[133]:
City State Country
Name
A NYC NaN NaN
D 601009 NaN NaN
F 000001 NaN NaN
G NaN NaN NaN
or
In [135]: wd[wd[['State', 'Country']].isnull().all(axis=1)]
Out[135]:
City State Country
Name
A NYC NaN NaN
D 601009 NaN NaN
F 000001 NaN NaN
G NaN NaN NaN
ID RT EZ Z0 Z1 Z2 RHO PHE
1889 UN NA 1.0000 0.0000 0.0000 0.8765 -1
1890 UN NA 1.0000 0.0000 0.0000 0.4567 -1
1891 UN NA 1.0000 0.0000 0.0000 0.0012 -1
1892 UN NA 1.0000 0.0000 0.0000 0.1011 -1
I would like to grep all the IDs that have column 'RHO' with value less than 0.2, and the other columns are included for the selected rows.
Use awk directly by saying awk '$field < value':
$ awk '$7<0.2' file
1891 UN NA 1.0000 0.0000 0.0000 0.0012 -1
1892 UN NA 1.0000 0.0000 0.0000 0.1011 -1
As RHO is the column 7, it checks that field.
In case you just want to print a specific column, say awk '$field < value {print $another_field}'. For the ID:
$ awk '$7<0.2 {print $1}' file
1891
1892